1 Introduction

Imagine a pedestrian on the sidewalk, and an autonomous car cruising on the nearby road. If the pedestrian stays on the sidewalk and continues walking, they are of no concern for the self-driving car. What, instead, if they start approaching the road, in a possible attempt to cross it? Any future prediction about the pedestrian’s action and their possible position on/off the road would crucially help the autonomous car avoid any potential incident. It would suffice to foresee the pedestrian’s action label and position half a second early to avoid a major accident. As a result, awareness about surrounding human actions is essential for the robot-car.

Fig. 1.
figure 1

An Illustration of the action tube prediction problem using an example in which a “pickup” action is being performed on a sidewalk. As an ideal case, we want the system to predict an action tube as shown in (c) (i.e. when 100% of the video has been processed) just by observing 25% of the entire clip (a). We want the tube predictor to predict the action class label (shown in red) alongside predicting the spatial location of the tube. The red shaded bounding boxes denote the detected tube in the observed portion of the input video, whereas, the blue coloured bounding boxes represent the future predicted action tube for the unobserved part of the clip. (Color figure online)

We can formalise the problem as follows. We seek to predict both the class label and the future spatial location(s) of an action instance as early as possible, as shown in Fig. 1. Basically, it translates into early spatiotemporal action detection [31], achieved by completing action instance(s) for the unobserved part of the video. As commonly accepted, action instances are here described by ‘tubes’ formed by linking bounding box detections in time.

In an existing relevant work by Singh et al. [31], early label prediction and online action detection are performed jointly. The action class label for an input video is predicted early on just by observing a smaller portion (a few frames) of it, whilst the system incrementally builds action tubes in an online fashion. In contrast, the proposed approach can predict both the class label of an action and its future location(s) (i.e., the future shape of an action tube). In this work, by ‘prediction’ we refer to the estimation of both an action’s label and location in future, unobserved video segments. We term ‘detection’ the estimation of action labels/locations in the observed segment of video up to any given time, i.e., for present and past video frames.

The computer vision community is witnessing a rising interest in problems such as early action label prediction [2, 9, 16, 20, 24, 28, 31, 32, 39, 40, 42], online temporal action detection [6, 23, 33, 39], online spatio-temporal action detection [31, 32, 38], future representation prediction [16, 34] or trajectory prediction [1, 15, 21]. Although, all these problems are interesting, and definitely encompass a broad scope of applications, they do not entirely capture the complexity involved by many critical scenarios including, e.g., surgical robotics or autonomous driving. In opposition to [31, 32], which can only perform early label prediction and online action detection, in this work we propose to predict both future action location and action label. A number of challenges make this problem particularly hard, e.g., the temporal structure of an action is obviously not completely observed; locating human actions is itself a difficult task; the observed part can only provide clues about the future locations. In addition, camera movement can make it even harder to extrapolate an entire tube. We propose to solve these problems by regressing the future locations from the present tube.

The ability to predict action micro-tubes (sets of temporally connected bounding boxes spanning k video frames) from pairs of frames [29] or sets of k frames [10, 13] provides a powerful tool to extend the single frame-based online approach by Singh et al. [31] in order to cope with action location prediction, while retaining its incremental nature. Combining the basic philosophies of [29, 31] has thus the potential to provide an interesting and scalable approach to action prediction.

Briefly, the action micro-tubes network (AMTnet, [29]), divides the action tube detection problem into a set of smaller sub-problems. Action ‘micro-tubes’ are produced by a convolutional neural network (a 3D region proposal network) processing two input frames that are \(\varDelta \) apart. Each micro-tube consists of two bounding boxes belonging to the two frames. When the network is applied to consecutive pairs of frames, it produces a set of consecutive micro-tubes which can be finally linked [31] to form complete action tubes. The detections forming a micro-tube can be considered as implicitly linked, hence reducing the number of linking subproblems. Whereas AMTnet was originally designed to generate micro-tubes using only appearance (RGB) inputs, here we augment it by introducing the feature-level fusion of flow and appearance cues, drastically improving its performance and, as a result, that of TPnet.

Fig. 2.
figure 2

Workflow illustrating the application of TPnet to a test video at a time instant t. The network takes frames \(f_t\) and \(f_{t+\varDelta }\) as input and generates classification scores, the micro-tube (in red) for frames \(f_t\) and \(f_{t+\varDelta }\), and prediction bounding boxes (in blue) for frames \(f_{t-\varDelta _p}\), \(f_{t+\varDelta _f}\) up to \(f_{t+n\varDelta _f}\). All bounding boxes are considered to be linked to the micro-tube. Note that predictions also span the past: a setting called smoothing in the estimation literature. \(\varDelta _p\), \(\varDelta _f\) and n are network parameters that we cross-validate during training. (Color figure online)

Concept: We propose to extend the action micro-tube detection architecture by Saha et al. [29] to produce, at any time t, past (\(\tau <t\)), present, and future (\(\tau >t\)) detection bounding boxes, so that each (extended) micro-tube contains bounding boxes for both observed and not yet observed frames. All bounding boxes, spanning presently observed frames as well as past and future ones (in which case we call them predicted bounding boxes), are considered to be linked, as shown in blue in Fig. 2.

We call this new deep network ‘TPnet’.

Once bounding boxes are regressed, the online tube construction method of Singh et al. [31] can be incrementally applied to the observed part of the video to generate one or more ‘detected’ action tubes at any time instant t.

Further, in virtue of TPnet and online tube construction, the temporally linked micro-tubes forming each currently detected action tube (spanning the observed segment of the video) also contain past and future estimated bounding boxes. As these predicted boxes are implicitly linked to the micro-tubes which compose the presently detected tube, the problem of linking to the latter the future bounding boxes, leading to a whole action tube, is automatically addressed.

The proposed approach provides two main benefits: (i) future bounding box predictions are implicitly linked to the present action tubes; (ii) as the method relies only on two consecutive frames separated by a constant distance \(\varDelta \), it is efficient enough to be applicable to real-time scenarios.

Contributions: In Summary we present a Tube Predictor network (TPnet) which:

  • given a partially observed video, can (early) predict video long action tubes in terms of both their classes and the constituting bounding boxes;

  • demonstrates that training a network to make predictions also helps in improving action detection performance;

  • demonstrates that feature-based fusion works better than late fusion in the context of spatiotemporal action detection.

2 Related Work

Early Label Prediction. Early, online action label prediction has been studied using dynamic bag of words [28], structured SVMs [9], hierarchical representations [20], LSTMs [39] and Fisher vectors [6]. Recently, Yeung et al. [39, 40] have proposed a variant of long short-term memory (LSTM) deep networks for modelling these temporal relations via multiple input and output connections. Kong et al. [16], instead, make use of variational auto-encoders to predict a representation for the whole video and use it to determine the action category for the whole video as early as possible. Probabilistic approaches based on Bayesian networks [24], Conditional Random Fields [17] or Gaussian processes [12] may help in activity anticipation. However, inference in such generative approaches is often expensive. None of these methods address the full online label and spatiotemporal location prediction setting considered here.

Online Action Detection. Soomro et al. [32] have recently proposed an online method which can predict an action’s label and detect its location by observing a relatively smaller portion of the entire video sequence. They use segmentation to perform online detection via SVM models trained on fixed length segments of the training videos. Similarly, Singh et al. [31] have extended online action detection to untrimmed videos with help of an online tube construction algorithm built on the top of frame-level action bounding box detections. Similarly, Behl et al. [3] solve online detection with help of tracking formulation. However, these approaches [3, 31, 32] only perform action localisation for the observed part of the video and adopt the label predicted for the currently detected tube as the label for the whole video.

To the best of our knowledge, no existing method generates predictions concerning both labels and action tube geometry. Interestingly, Yang et al. [38] use features from current, frame t proposals to ‘anticipate’ region proposal locations in \(t+\varDelta \) and to generate detections at time \(t+\varDelta \), thus failing to take full advantage of the anticipation trick to predict the future spatiotemporal extent of the action tubes.

Advances in action recognition are always going to be helpful in action prediction from a general representation learning point of view. For instance, Gu et al. [8] have recently improved on [13, 25] by plugging in the inflated 3D network proposed by [5] as a base network on multiple frames. Although they use a very strong base network pre-trained on the large “Kinetics” [14] dataset, they do not handle the linking process within the network as the AVA [8] dataset’s annotations are not temporally linked. Analogously, learning to predict future representation [34] can be useful in general action prediction (cfr. e.g. [16]).

Recently, inspired by the record-breaking performance of CNN-based object detectors [22, 26, 27], a number of scholars [3, 7, 25, 30, 31, 35, 37, 41] have tried to extend frame-level object detectors to videos for spatio-temporal action localisation. These approaches, however, fail to tackle spatial and temporal reasoning jointly at the network level, as spatial detection and temporal association are treated as two disjoint problems. More recent works have attempted to address this problem by reducing the amount of linking required with the help of ‘micro-tubes’ [29] or ‘tubelets’ [10, 13] for small sets of frames taken together, where micro-tube boxes from different frames are considered to be linked together. AMTnet [29] by Saha et al. is particularly interesting, because of its compact (GPU memory-wise) and flexible nature, as it can exploit pairs of successive frames \(\varDelta \) sampling intervals apart, that it can also leverage sparse annotations [36] as well. For these reasons in this work we build on AMTnet as base network, improving its feature representation by feature-level fusion of motion and appearance cues.

3 Methodology

In this section, we describe our tube prediction framework for the problem formulation described in Sect. 3.1. Our approach has four main components. Firstly, we tie the future action tube prediction problem (Sect. 3.1) with action micro-tube [29] detection. Secondly, we devise our tube prediction network (TPnet) to predict future bounding boxes along with current micro-tubes, and describe its training procedure in Sect. 3.3. Thirdly, we use TPnet in a sliding window fashion (Sect. 3.4) in the temporal direction while generating micro-tubes and corresponding future predictions. These, eventually, are fed to a tube prediction framework (Sect. 3.4) to generate the future of any current action tube being built using micro-tubes.

3.1 Problem Statement

We define an action tube as a connected sequence of detection boxes in time without interruptions and associated with a same action class c, starting at first frame \(f_1\) and ending last frame \(f_T\), in trimmed video: \(\mathcal {T}_c = \{ {b}_{1}, ... {b}_{t}, ... {b}_{T}\}\). Tubes are constrained to span the entire video duration, like in [7]. At any time point t, a tube is divided into two parts, one needs to be detected \(\mathcal {T}_{c}^{d} = \{ {b}_{1}, ... {b}_{t}\}\) up to \(f_t\) and another part needs to be predicted/estimated \(\mathcal {T}_{c}^{p} = \{{b}_{t+1}, ... {b}_{T}\}\) from frame \(f_{t+1}\) to \(f_{T}\) along with its class c. The observed part of the video is responsible for generating \(\mathcal {T}_{c}^{d}\) (red in Fig. 1), while we need to estimate the future section of the tube \(\mathcal {T}_{c}^{p}\) (blue in Fig. 1) for the unobserved segment of the video. The first sub-problem, the online detection of \(\mathcal {T}_{c}^{d}\), is explained in Sect. 3.2. The second sub-problem (the estimation of the future tube segment \(\mathcal {T}_{c}^{p}\)) is tackled by a tube prediction network (TPnet, Sect. 3.3) in a novel tube prediction framework (Sect. 3.4).

3.2 From Micro-tubes to Full Action Tubes

Saha et al. [29] introduced micro-tubes in their action micro-tube network (AMTnet) proposal, shown in Fig. 3. AMTnet decomposes the problem of detecting \(\mathcal {T}_c\) into a set of smaller problems, detecting micro-tubes \({m}_t = \{b_t,b_{t+\varDelta }\}\) at time t along with their classification scores for \(C+1\) classes, using two successive frames \(f_t\) and \(f_{t+\varDelta }\) as an input (Fig. 3(a)). Subsequently, the detection micro-tubes \(\{{m}_{1} ... {m}_{t-\varDelta }\}\) are linked up in time to form action tube \(\mathcal {T}_c^d\). Similar to [22], one background class is added to the class list which takes the number classes to \(C+1\).

AMTnet employs two parallel CNN streams (Fig. 3(b)), one for each frame, which produce two feature maps (Fig. 3(c)). These feature maps are stacked together into one (Fig. 3(d)). Finally, convolutional heads are applied in a sliding window (spatial) fashion over predefined \(3\times 3\) anchor regions [22], which correspond to P prior [22] or anchor [27] boxes. Convolutional heads produce a \(P\times 8\) output per micro-tube (Fig. 3(f)) and \(P\times (C+1)\) corresponding classification scores (Fig. 3(g)). Each micro tube has 8 coordinate, 4 for the bounding box \(b_t\) in frame \(f_t\) and 4 for bounding box \(b_{t+\varDelta }\) in frame \(f_{t+\varDelta }\). As shown in Fig. 3(f), the pair of boxes can be considered as implicitly linked together, hence the name micro-tube.

Fig. 3.
figure 3

Overview of the action micro-tube detection network (AMTnet). As it only predicts micro-tubes and their scores, here we modify it to predict the future locations associated with the given micro-tubes, as shown in Fig. 2.

Originally, Saha et al. [29] employed FasterRCNN [27] as base detection architecture. Here, however, we switch to Single Shot Detector (SSD) [22] as a base detector for efficiency reasons. Singh et al. [31] used SSD to propose an online and real-time action tube generation algorithm, while Kalogeiton et al. [13] adapted SSD to detect micro-tubes (or, in their terminology, ‘tubelets’) k frames long.

More importantly, we make two essential changes to AMTnet. Firstly, we enhance its feature representation power by fusing appearance features (based on RGB frames) and flow features (based on optical flow) at the feature level (see the fusion step shown in Fig. 4), unlike the late fusion approach of [13, 31]. Note that the original AMTnet framework does not make use of optical flow at all. We will show that feature-level fusion dramatically improves its performance. Secondly, the AMTnet-based tube detection framework proposed in [29] is offline, as micro-tube linking is done recursively in an offline fashion [7]. Similar to Kalogeiton et al. [13], we adapt the online linking method of [31] to link micro-tubes in to a tube \(\mathcal {T}_{c}^{d}\).

Micro-tube Linking Details: Let \(B_t\) be the set of detection bounding boxes from frame \(f_t\), and \(B_{t+1}\) the corresponding set from \(f_{t+1}\), generated by a frame-level detector. Singh et al. [31] associate boxes in \(B_t\) to boxes in \(B_{t+1}\), whereas, in our case, we need to link micro-tubes \(m_t \in M_t \doteq B_t^{1} \times B_{t+\varDelta }^{2}\) from a pair of frames \(\{f_t,f_{t+\varDelta }\}\) to microtubes \(m_{t+\varDelta } \in M_{t+\varDelta } \doteq B_{t+\varDelta }^1 \times B_{t+2\varDelta }^2\) from the next set of frames \(\{f_{t+\varDelta },f_{t+2\varDelta }\}\). This happens by associating elements of \(B_{t+\varDelta }^2\), coming from \(M_t\), with elements of \(B_{t+\varDelta }^1\), coming from \(M_{t+\varDelta }\). Interestingly, the latter is a relatively easier sub-problem, as all such detections are generated based on the same frame, unlike the across frame association problem considered in [31]. The association is achieved based on Intersection over Union (IoU) and class score, as the tubes are built separately for each class in a multi-label scenario. For more details, we refer the reader to [31].

Since we adopt the online linking framework of Singh et al. [31], we follow most of the linking setting used by them, e.g.: linking is done for every class separately; the non-maximal threshold is set to 0.45. As shown in Fig. 5(a) to (b), the last box of the first micro-tube (red) is linked to the first box of next micro-tube (red). So, the first set of micro-tubes is produced at \(f_1\), the following one at \(f_{\varDelta }\) the one after that at \(f_{2\varDelta }\), and so on. As a result, the last micro-tube is generated at \(f_{t-\varDelta }\) to cover the observable video duration up to time t. Finally, we solve for the association problem as described above.

3.3 Training the Tube Prediction Network (TPnet)

AMTnet allow us to detect current tubes \(\mathcal {T}_{c}^{d}\) by generating a set of successive micro-tubes \(\{{m}_{1} ... {m}_{t-\varDelta }\}\), where \({m}_{t-\varDelta } = \{b_{t-\varDelta },b_t\}\). However, our aim is to predict the future section \(\mathcal {T}_{c}^{p}\) of the tube using the latter linked micro-tubes, up to time t.

To address this problem, we propose a tube prediction framework aimed at simultaneously estimating a micro-tube \({m}_t\), a set \({z}_t = \{{b_{t-\varDelta _p}, b_{t+\varDelta _f}, ... b_{t+n\varDelta _f}}\}\) of past and future detections, and the classification scores for the \(C+1\) classes. \(\varDelta _p\) measures how far in the past we are looking into, whereas \(\varDelta _f\) is a future step size, and n is the number of future steps. This is performed by a new Tube Prediction network (TPnet).

Fig. 4.
figure 4

Overview of the tube prediction network (TPnet) architecture at training time.

The underlying architecture of TPnet is shown in Fig. 4. TPnet takes two successive frames from time t and \(t+\varDelta \) as input. The two input frames are fed to two parallel CNN streams, one for appearance and one for optical flow. The resulting feature maps are fused together, either by concatenating or by element-wise summing the given feature maps. Finally, three types of convolutional output heads are used for P prior boxes as shown in Fig. 4. The first one produces the \(P\times (C+1)\) classification outputs; the second one regresses the \(P\times 8\) coordinates of the micro-tubes, as in AMTnet; the last one regresses \(P\times (4(1+n))\) coordinates, where 4 coordinates correspond to the frame at \(t-\varDelta _p\), and the remaining 4n are associated with the n future steps. The training procedure of the new architecture is illustrated below.

Multi-task Learning. TPnet is designed to strive for three objectives, for each prior box p. The first task (i) is to classify the P prior boxes; the second task (ii) is to regress the coordinates of the micro-tubes; the last (iii) is to regress the coordinates of the past and future detections associated with each micro-tube.

Given a set of P anchor boxes and the respective outputs we compute a loss following the training objective of SSD [22]. Let \(x_{i,j}^c = \{0,1\}\) be the indicator for matching the i-th prior box to the j-th ground truth box of category c. We use the bipartite matching procedure described in [22] for matching the ground truth micro-tubes \(G = \{g_t,g_{t+\varDelta }\}\) to the prior boxes, where \(g_t\) is a ground truth box at time t. The overlap is computed between a prior box p and micro-tube G as the mean IoU between p and the ground truth boxes in G. A match is defined as positive (\(x_{i,j}^c = 1\)) if the overlap is more than or equal to 0.5.

The overall loss function \(\mathcal {L}\) is the following weighted sum of classification loss (\(\mathcal {L}_{cls}\)), micro-tube regression loss (\(\mathcal {L}_{reg}\)) and prediction loss (\(\mathcal {L}_{pred}\)):

$$\begin{aligned} \mathcal {L}(x,c,m,G,z,Y) = \frac{1}{N} \big ( \mathcal {L}_{cls}(x,c) + \alpha \mathcal {L}_{reg}(x,m,G) + \beta \mathcal {L}_{pred}(x,z,Y) \big ), \end{aligned}$$
(1)

where N is the number of matches, c is the ground truth class, m is the predicted micro-tube, G is the ground truth micro-tube, z assembles the predictions for the future and the past, and Y is the ground truth of future and past bounding boxes associated with the ground truth micro-tube G. The values of \(\alpha \) and \(\beta \) are both set to 1 in all of our experiments: different values might result in better performance.

The classification loss \(\mathcal {L}_{cls}\) is a softmax cross-entropy loss; a hard negative mining strategy is also employed, as proposed in [22]. The micro-tube loss \(\mathcal {L}_{reg}\) is a Smooth L1 loss [27] between the predicted (m) and the ground truth (G) micro-tube. Similarly, the prediction loss \(\mathcal {L}_{pred}\) is also a Smooth L1 loss between the predicted boxes (z) and the ground truth boxes (Y). As in [22, 27], we regress the offsets with respect to the coordinates of matched prior box p matched to G for both m and z. We use the same offset encoding scheme as used in [22].

Fig. 5.
figure 5

Overview of future tube (\(\mathcal {T}_{c}^{p}\)) prediction using the predictions that are linked to micro-tubes. The first row (a) shows two output micro-tubes in light red and red and their corresponding predictions in future in light blue and blue. In row (b) two micro-tubes are linked together, after which they are shown in the same colour (red). By induction on the previous step, in row (c) we show that the predictions associated with two micro-tubes are linked together as well, hence forming one single tube. The observed segment is shown in red, while the predicted segment for the part of the video yet to observe is shown in blue. (Color figure online)

3.4 Tube Prediction Framework

TPnet is shown in Fig. 2 at test time. As in the training setting, it observes only two frames that are \(\varDelta \) apart at any time point t. The outputs of TPnet at any time t are linked to a micro-tube, each micro-tube containing a set of bounding boxes \(\{m_t = \{b_t, b_{t+\varDelta }\}; z_t = \{b_{t-\varDelta _p},b_{t+\varDelta _f},...,b_{t+\varDelta _f}\}\}\), which are considered as linked together.

As explained in Sect. 3.2, given a set of micro-tubes \(\{{m}_{1} ... {m}_{t-\varDelta }\}\) we can construct \(\mathcal {T}_{c}^{d}\) by online linking [31] of the micro-tubes. As a result, we can use predictions for \(t+\varDelta _f\) up to \(t+n\varDelta _f\) to generate the future of \(\mathcal {T}_{c}^{d}\), thus extending it further into the future as shown in Fig. 5. More specifically, as it is indicated in Fig. 5(a), a micro tube at \(t-2\varDelta \) is composed by \(n+2\) bounding boxes (\(\{b_{t-2\varDelta },b_{t-\varDelta }, b_{t-2\varDelta +\varDelta _f}, ... b_{t-\varDelta +n\varDelta _f}\}\)) linked together. The last micro-tube is generated from \(t-\varDelta \). In the same fashion, putting together the predictions associated with all the past micro-tubes (\(\{{m}_{1} ... {m}_{t-\varDelta }\}\)) yields a set of linked future bounding boxes (\(\{b_{t+1}, ... , b_{t+\varDelta +\varDelta _f}, ... , b_{t-\varDelta +n\varDelta _f}\}\)) for the current action tube \(\mathcal {T}_{c}^{d}\), thus outputting a part of the desired future \(\mathcal {T}_{c}^{p}\).

Now, we can generate future tube \(\mathcal {T}_{c}^{p}\) from the set of linked future bounding boxes (\(\{b_{t+1}, ... b_{t-\varDelta +\varDelta _f},... b_{t-\varDelta +n\varDelta _f}\}\)) from \(t+1\) to \(t-\varDelta +n\varDelta _f\) and simple linear extrapolation of bounding boxes from \(t-\varDelta + n\varDelta _f\) to T. Linear extrapolation is performed based on the average velocity of the each coordinates from last 5 frames, predictions outside the image coordinate are trimmed to the image edges.

4 Experiments

We test our action tube prediction framework (Sect. 3) on four challenging problems: (i) action localisation (Sect. 4.1), (ii) early action prediction (Sect. 4.2), (iii) online action localisation (Sect. 4.2), (iv) future action tube prediction (Sect. 4.3) Finally, evidence of real time capability is quantitatively demonstrated in (Sect. 4.4).

J-HMDB-21. We evaluate our model on the J-HMDB-21 [11] benchmark. J-HMDB-21 [11] is a subset of the HMDB-51 dataset [19] with 21 action categories and 928 videos, each containing a single action instance and trimmed to the action’s duration. It contains atomic action which are 20–40 frames long. Although, videos are of short duration (max 40 frames), we consider this dataset because tubes belong to the same class and we think it is a good dataset to start with for action prediction task.

Evaluation Metrics. Now, we define the evaluation metrics used in this paper. (i) We use a standard mean-average precision metric to evaluate the detection performance when the whole video is observed.

(ii) Early label prediction task is evaluated by video classification accuracy [31, 32] as early as when only 10% of the video frames are observed.

(iii) Online action localisation (Sect. 4.2) is set up based on the experimental setup of [31], and use mAP (mean average precision) as metric for online action detection i.e. it evaluates present tube (\(\mathcal {T}_{c}^{d}\)) built in online fashion.

(iv) The future tube prediction is a new task; we propose to evaluate its performance in two ways. Firstly, we evaluate the quality of the whole tube prediction from the start of the videos to the end as early as when only 10% of the video is observed. The entire tube predicted (by observing only a small portion (%) of the video) is compared against the ground truth tube for the whole video. Based on the detection threshold we can compute mean-average-precision for the complete tubes, we call this metric completion-mAP (c-mAP). Secondly, we measure how well the future predicted part of the tube localises. In this measure, we compare the predicted (\(\mathcal {T}_{c}^{p}\)) tube with the corresponding ground truth future tube segment. Given the ground truth and the predicted future tubes, we can compute the mean-average precision for the predicted tubes, we call this metric prediction-mAP (p-mAP).

We report the performance of previous three tasks (i.e. task ii to iv) as a function of Video Observation Percentage, i.e., the portion (%) of the entire video observed.

Baseline. We modified AMTnet to fuse flow and appearance features Sect. 3.2. We treat it as a baseline for all of our tasks. Firstly, we show how feature fusion helps AMTnet in Sect. 4.1, and compare it with other action detection methods along with our TPnet. Secondly in Sect. 4.3, we linearly extrapolate the detection from AMTnet to construct the future tubes, and use it as a baseline for tube prediction task.

Implementation Details. We train all of our networks with the same set of hyper-parameters to ensure the fair comparison and consistency, including TPnet and AMTnet. We use an initial learning rate of 0.0005, and the learning rate drops by a factor of 10 after 5K and 7K iterations. All the networks are trained up to 10K iterations. We implemented AMTnet using pytorch (https://pytorch.org/). We initialise AMTnet and TPnet models using the pretrained SSD network on J-HMDB-21 dataset on its respective train splits. The SSD network training is initialised using image-net trained VGG network. For, optical flow images, we used optical flow algorithm of Brox et al. [4]. Optical flow output is put into a three channel image, two channels are made of flow vector and the third channel is the magnitude of the flow vector.

Table 1. Action localisation results on JHMDB dataset. The table is divided into four parts. The first part lists approaches which takes a single frame as input; the second part presents approaches which takes multiple frames as input; the third part contemplates different fusion strategies of our feature-level fusion (based on AMTnet); lastly, we report the detection performance of our TPnet by ignoring the future and past predictions and only use the detected micro-tubes to produce the final action tubes.

TPnet\({}_{abc}\). The training parameters of our TPnet are used to define the name of the setting in which we use our tube prediction network. The network name TPnet\({}_{abc}\) represents our TPnet where \(a = \varDelta _p\), \(b = \varDelta _f\) and \(c = n\), if \(\varDelta _p\) is set to 0 it means network doesn’t learn to predict the past bounding boxes. In all of our settings, we use \(\varDelta =1\).

Fig. 6.
figure 6

Early label prediction results (video-level label prediction accuracy) on J-HMDB-21 dataset in sub-figure (a). Online action detection results (mAP with detection threshold \(\delta = 0.5\)) on J-HMDB-21 dataset are shown in sub-figure (b). TPnet\({}_{abc}\) represents our TPnet where \(a = \varDelta _p\), \(b = \varDelta _f\) and \(c = n\).

4.1 Action Localisation Performance

Table 1 shows the traditional action localisation results for the whole action tube detection in the videos of J-HMBD-21 dataset.

Feature fusion compared to the late fusion scheme in AMTnet shows (Table 1) remarkable improvement, at detection threshold \(\delta =0.75\) the gain with feature level fusion is \(10\%\), as a result, it is able to surpass the performance of ACT [13], which relies on set of 6 frames as compared to AMTnet which uses only 2 successive frames as input. Looking at the average-mAP (\(\delta =0.5:95\)), we can see that the fused model improves by almost \(8\%\) as compared to single frame SSD model of Singh et al. [31]. We can see that concatenation and sum fusion perform almost similar for AMTnet. Sum fusion is little less memory intensive on the GPUs as compared to the concatenation fusion; as a result, we use sum fusion in our TPnet.

TPnet for detection is shown in the last part of the Table 1, where we only use the detected micro-tubes by TPnet to construct the action tubes (Sect. 3.2). We train TPnet to predict future and past (i.e. when \(\varDelta _p>0\)) as well as present micro-tubes. We think that predicting bounding boxes for both the past and future video segments acts as a regulariser and helps improving the representation of the whole network. Thus, improving the detection performance (Table 1 TPNet\({}_{051}\) and TPNet\({}_{451}\)). However, that does not mean adding extra prediction task always help when a network is asked to learn prediction in far future, as is the case in TPNet\({}_{053}\) and TPNet\({}_{453}\), we have a drop in the detection performance. We think there might be two possible reasons for this, (i) network might starts to focus more on prediction task, and (ii) videos in J-HMDB-21 are short and number of training samples decreases drastically (19K for TPNet\({}_{051}\) and 10K for TPNet\({}_{453}\)), because we can not use edge frames of the video in training samples as we need a ground truth bounding box which is 15 frames in the future, as \(\varDelta _f=5\) and \(n=3\) for TPNet\({}_{053}\). However, in Sect. 4.3, we show that the TPNet\({}_{053}\) model is the best to predict the future very early.

4.2 Early Label Prediction and Online Localisation

Figure 6(a) and (b) show the early prediction and online detection capabilities of Singh et al. [31], AMTnet-Feature Fusion-sum and our TPnet.

Soomro et al. [32]’s method also perform early label prediction on J-HMDB-21; however, their performance is deficient, as a result the plot would become skewed (Fig. 6(a)), so we omit theirs from the figure. For instance, by observing only the initial \(10\%\) of the videos in J-HMDB-21, TPnet\({}_{453}\) able to achieve a prediction accuracy of \(58\%\) as compared to \(48\%\) by Singh et al. [31] and \(5\%\) by Soomro et al. [32], which is in fact higher than the \(43\%\) accuracy achieved by [32] after observing the entire video. As more and more video observed, all the methods improve, but TPnet\({}_{451}\) show the most gain, however, TPnet\({}_{053}\) observed the least gain from all the TPnet settings shown. Which is in-line with action localisation performance discussed in the previous Sect. 4.1. We can observe the similar trends in online action localisation performance shown in Fig. 6(b). To reiterate, TPnet\({}_{053}\) doesn’t get to see the training samples from the end portion of the videos, as it needs a ground truth bounding box from 15 frames ahead. So, the last frame it sees of any training video is \(T-15\), which is almost half the length of the most extended video (40 frames) in J-HMDB-21. This effect magnifies when online localisation performance measured at \(\delta =0.75\), we provide the evidence of it in the supplementary material.

Fig. 7.
figure 7

Future action tube prediction results (a) (prediction-mAP (p-mAP)) for predicting the tube in unobserved part of the video. Action tube prediction results (b) (completion-mAP (c-mAP)) for predicting video long tubes as early as possible on J-HMDB-21 dataset in sub-figure (b). We use p-mAP (a) and c-mAP (b) with detection threshold \(\delta = 0.5\) as evaluation metrics on J-HMDB-21 dataset. TPnet\({}_{abc}\) represents our TPnet where \(a = \varDelta _p\), \(b = \varDelta _f\) and \(c = n\).

4.3 Future Action Tube Prediction

Our main task of the paper is to predict the future of action tubes. We evaluate it using two newly proposed metrics (p-mAP and c-mAP) as explained earlier at the start of the experiment Sect. 4. Result are shown in Fig. 7 for future tube prediction (Fig. 7(a)) with p-mAP metric and tube completion with c-mAP as metric.

Although, the TPnet\({}_{053}\) is the worst setting of TPnet model for early label prediction (Fig. 6(a)), online detection (Fig. 6(b)) and action tube detection (Table 1), but as it predicts furthest in the future (i.e. 15 frame away from the present time), it is the best model for early future tube prediction (Fig. 7(a)). However, it does not observe as much appreciation in performance as other settings as more and more frames are seen, owing to the reduction in the number of training samples. On the other hand, TPnet\({}_{451}\) observed large improvement as compared to TPnet\({}_{051}\) as more and more portion of the video is observed for tube completion task (Fig. 7(b)), which strengthen our arguement that predicting not only the future but also the past is useful to achieve more regularised predictions.

Comparision with the Baseline. As explained above, we use AMTnet as a baseline, and its results can be seen in all the plots and the Table. We can observe that our TPnet performs better than AMTnet in almost all the cases, especially in our desired task of early future prediction (Fig. 7(a)) TPnet\({}_{043}\) shows almost \(4\%\) improvement in p-mAP (at \(10\%\) video observation) over AMTnet.

Discussion. Predicting further into the future is essential to produce any meaningful predictions (seen in TPnet\({}_{053}\)), but at the same time, predicting past is helpful to improve overall tube completion performance. One of the reasons for such behaviour could be that J-HMDB-21 tubes are short (max 40 frames long). We think training samples for a combination of TPnet\({}_{053}\) and TPnet\({}_{451}\), i.e. TPnet\({}_{453}\) are chosen uniformly over the whole video while taking care of absence of ground truth in the loss function could give us better of both settings. We show the result of TPnet\({}_{453}\) in current training setting in supplementary material. The idea of regularising based on past prediction is similar to the one used by Ma et al. [23].

4.4 Test Time Detection Speed

Singh et al. [31] showcase their method’s online and real-time capabilities. Here we use their online tube generation method for our tube prediction framework to inherit those properties. The only question mark is TPnet’s forward pass speed. We thus measured the average time taken for a forward pass for a batch size of 1 as compared to 8 by [31]. A single forward pass takes 46.8 ms to process one text example, showing that it can be run in almost real-time at 21 fps with two streams on a single 1080Ti GPU. One can improve speed even further by testing TPnet with \(\varDelta \) equal to 2 or 4 and obtain a speed improvement of \(2\times \) or \(2\times \). However, use of dense optical flow [4], which is slow, but as in [31], we can always switch to real-time optical [18] with small drop in performance.

5 Conclusions

We presented TPnet, a deep learning framework for future action tube prediction in videos which, unlike previous online tube detection methods [31, 32], generates future of action tubes as early as when \(10\%\) of the video is observed. It can cope with the future uncertainty better than the baseline methods while remaining state-of-the-art in action detection task. Hence, we provide a scalable platform to push the boundaries of action tube prediction research; it is implicitly scalable to multiple action tube instances in the video as future prediction is made for each action tube separately. We plan to scale TPnet for action prediction in temporally untrimmed videos in the future.