Keywords

1 Introduction

The amount of online video data is staggering: hundreds of hours of videos are uploaded to YouTube every minute [2], video posts on Facebook have been increasing by more than \(90\%\) annually, and by Cisco’s estimate, traffic from online videos will constitute over \(80\%\) of all consumer Internet traffic by 2020.

As such, there has been a growing interest in automatic video summarization. The main objective is to shorten a video while still preserving the important and relevant information it contains. A shortened video is more convenient and efficient for both interactive use (such as exploratory browsing), fast indexing and matching (such as responding to search queries). To this end, a common type of summary is composed of a selected set of frames (i.e. keyframes) [15, 19, 30, 32, 40, 40, 50, 63], segments, or subshots (i.e. keyshots) [29, 34, 41, 43]. Other formats are possible [12, 14, 25, 46, 51], though they are not the focus of this work.

Many methods for summarization have been proposed and studied. Among them, supervised learning based techniques have recently gained significant attention [6, 15, 18, 19, 49, 54, 64,65,66]. As opposed to unsupervised ones [21, 25, 26, 30, 33, 34, 36, 38, 43, 63], supervised techniques explicitly maximize the correspondence between the automatically generated summary and the human created one. As such, those techniques often achieve higher performance metrics.

Fig. 1.
figure 1

Conceptual diagram of our approach. Our model consists of two components. First, we use seq2seq models to generate summaries. Secondly, we introduce an retrospective encoder that maps the generated summaries to an abstract semantic space so that we can measure how well the summaries preserve the information in the original videos, i.e. the summary () should be close to its original video () while distant from other videos ({}) and summaries ({}). In this semantic space, we derive metric learning based loss functions and combine them with the discriminative loss by matching human generated summaries and predicted ones. See text for details. (Color figure online)

In particular, recent work on applying sequence-to-sequence learning techniques to video summarization has introduced several promising models [24, 38, 64,65,66]. Viewing summarization as a structured prediction problem, those techniques model the long-range dependency in video using the popular long short-term memory units (LSTM) and its variants [7, 17, 20]. The key idea is to maximize the accuracy of which frames or subshots are selected in the summary. Figure 2 shows the basic concepts behind these modeling techniques to which we collectively refer as seq2seq.

The conventional overlap accuracy is a useful surrogate for measuring how good the generated summaries are. However, it suffers from several flaws. First, it emphasizes equally the local correspondence between human and machine summaries on all frames or subshots. For instance, for a video of a soccer game, while arguably the moment around a goal-shot is likely in every human annotator’s summary, whether or not other less critical events (before or after the shot) are included could be quite varying – for example, subshots showing running across different sections of the play-field are equally good (or bad). Thus modeling those subshots in the summary is not always useful and necessary. Instead, we ought to assess whether the summary “holistically” preserves the most important and relevant information in the original video.

The second difficulty of employing overlap accuracy (and thus to a large degree, supervised learning techniques) is the demand of time-consuming and labor-intensive annotation procedures, which has been a limiting factor of existing datasets, cf. [65]. Thus supervised techniques have limited applicability when the annotated data is scarce.

To address these flaws, we propose a new sequence learning model—retrospective sequence-to-sequence learning (re-seq2seq). The key idea behind re-seq2seq is to measure how well the machine-generated summary is similar to the original video in an abstract semantic space.

Specifically, as the original video is processed by the encoder component of a seq2seq model, the encoder outputs a vector embedding which represents the semantic meaning of the original video. We then pass the outputs of the decoder, which should yield the desired summary, to a retrospective encoder to infer a vector embedding to represent the semantic meaning of the summary. If the summary preserves the important and relevant information in the original video, then we should expect that the two embeddings are similar (e.g. in Euclidean distance). Figure 1 schematically illustrates the idea. Besides learning to “pull” the summary close to the original video, our model also “pushes far away” the embeddings that do not form corresponding pairs.

The measure of similarity (or distance) is combined with the standard loss function (in seq2seq models) that measures how well the summary aligns locally on the frame/shot-level with what is provided by human annotators. However, the proposed learning of similarity in the abstract semantic space provides additional benefits. Since it does not use any human annotations, our measure can be computed on videos without any “ground-truth” summaries. This provides a natural basis for semi-supervised learning where we can leverage the large amount of unlabeled videos to augment training.

To summarize, our contributions are: (i) a novel sequence learning model for video summarization, which combines the benefits of discriminative learning by aligning human annotation with the model’s output and semi-supervised/unsup-ervised learning which ensure the model’s output is in accordance with the original video by embedding both in close proximity; (ii) an extensive empirical study demonstrating the effectiveness of the proposed approach on several benchmark datasets, and highlighting the advantages of using unlabeled data to improve summarization performance.

2 Related Work

Unsupervised video summarization methods mostly rely on manually designed criteria [8, 11, 21, 25, 27, 30, 31, 33, 34, 36, 43, 45, 50, 63, 67], e.g. importance, representativeness, and diversity. In addition, auxiliary cues such as web images [21, 26, 27, 50] or video categories [44, 45] are also exploited in the unsupervised (weakly-supervised) summarization process.

Supervised learning for video summarization has made significant progress [6, 15, 18, 19, 64]. Framing the task as a special case of structured prediction, Zhang et al. [65] proposed to use sequence learning methods, and in particular, sequence-to-sequence models [7, 10, 52] that have been very successful in other structured prediction problems such as machine translation [22, 23, 35, 47, 58], image or video caption generation [55, 57, 59], parsing [56], and speech recognition [4].

Several extensions of sequence learning models have since been studied [24, 38, 66, 68]. Yang et al. [60] and Mahasseni et al. [38] bear a somewhat similar modeling intuition as our approach. In their works, the model is designed such that the video highlight/summary (as a sequence) can generate another (video) sequence similar to the original video. However, this desideratum is very challenging to achieve. In particular, the mapping from video to summary is lossy, making the reverse mapping almost unattainable: for instance, the objects that are in the discarded frames but missing from the summarization frames cannot be reliably recovered from the summary alone. In contrast, our model has a simpler architecture, and contains fewer LSTMs units (thus fewer parameters); our approach only desires that the embeddings of human created and predicted summaries be close. It attains better results than those reported in [38].

Zhou et al. [68] propose to use reinforcement learning to model the sequential decision-making process of selecting frames as a summary. While interesting, the design of reward functions with heuristic criteria can be as challenging as using unsupervised methods for summarization. It shows minor gains over the less competitive model in the fully supervised learning model of [65]. Both Zhao et al. [66] and Ji et al. [24] introduce hierarchical LSTMs and attention mechanisms for modeling videos. They focus on maximizing the alignment between the generated summaries and the human annotators’. Our approach also uses hierarchical LSTMs but further incorporates objectives to match the generated summaries and the original videos, i.e. aiming to preserve important information in the original videos. The experimental study demonstrates the advantages.

3 Approach

We start by stating the setting for the video summarization task, introducing the notations and (briefly) the background on sequence learning with LSTMs [20, 39, 52, 61]. We then describe the proposed retrospective encoder sequence-to-sequence (re-seq2seq) approach in detail. The model extends the standard encoder-decoder LSTM by applying an additional encoder on the outputs of the decoder and introducing new loss functions.

3.1 Setting, Notations, and Sequence Learning Models

We represent a video as a sequence \(\varvec{X}= \{{\varvec{x}}_1, {\varvec{x}}_2, \cdots , {\varvec{x}}_{\mathsf {T}}\}\) where \({\varvec{x}}_t\), \(t \in 1,\cdots ,T\), is the feature vector characterizing the t-th frame in the video. We denote (sub)shots in the video by \(\varvec{B}= \{\varvec{b}_1, \varvec{b}_2, \cdots , \varvec{b}_{\mathsf {B}}\}\). Each of the \(\mathsf {B}\) shots refers to a consecutive subset of \(\varvec{X}\), and doesn’t overlap with others.

Video Summarization Task. The task is to select a subset of shots as the summary, denoted as \(\varvec{Y}= \{\varvec{y}_1, \varvec{y}_2, \cdots , \varvec{y}_{\mathsf {L}}\}\) where \(\varvec{y}_l\) indicates the feature vector for the l-th shot in the summary. Obviously we desire \(\mathsf {L}< \mathsf {B}< \mathsf {T}\). The ground-truth keyshots are denoted as \(\varvec{Z}= \{{\varvec{z}}_1, {\varvec{z}}_2, \cdots , {\varvec{z}}_\mathsf {L}\}\).

When \(\varvec{B}\) is not given (which is common in most datasets for the task), we use a shot-boundary detection model to infer the boundaries of the shots. This leads to misaligned shot boundaries between what is given in the ground-truth keyshots and what is inferred. We discuss how we handle this in the experimental details and in the Suppl. For clarity, we assume \(\varvec{B}\) is known and given throughout this section.

Sequence Learning. Long short-term memory (LSTMs) are a special kind of recurrent neural networks that are adept at modeling long-range dependencies. They have been used to model temporal (and sequential) data very successfully [13, 20, 61]. An LSTM has time-varying memory state variables \(\varvec{c}_t\) and an output variable \(\varvec{h}_t\). The values of those variables depend on the current input, past outputs, and memory states. The details of the basic LSTM cells that we use in this paper are documented in the Suppl. and [16, 65].

Fig. 2.
figure 2

Our proposed approach for video summarization. The model has several distinctive features. First, it uses hierarchical LSTMs: a bottom layer of LSTMs models shots composed of frames and an upper-layer LSTM models the video composed of shots. Secondly, the model has a retrospective encoder () that computes the embeddings of the outputs of the decoder (). The training objective for the retrospective encoder is to ensure the embedding of the summary outputs matches the embedding computed from the original inputs, cf. Eq. (5). (Color figure online)

seq2seq typically consists of a pair of LSTMs—the encoder and the decoder—which are combined to perform sequence transduction [3, 55]. Specifically, the encoder reads the input sequence \( \{{\varvec{x}}_1, {\varvec{x}}_2, \cdots , {\varvec{x}}_{\mathsf {T}}\}\) sequentially (or in any predefined order), and calculates a sequence of hidden states \(\varvec{H}= \{\varvec{h}_1, \varvec{h}_2, \cdots , \varvec{h}_{\mathsf {T}}\}\). Each hidden state \(\varvec{h}_t\) at step t feeds into the next step at \((t+1)\). The last hidden state \(\varvec{h}_{\mathsf {T}}\) feeds into the decoder. The decoder is similar to the encoder except for two changes: (1) the decoder has its own hidden state sequence \(\varvec{V}= \{\varvec{v}_1, \varvec{v}_2, \cdots , \varvec{v}_{\mathsf {L}}\}\); (2) the decoder does not have external input. Instead, its input (i.e. its version of \({\varvec{x}}_t\)) at step \((l+1)\) is now its own output at step l, given by \( \varvec{y}_l = g(\varvec{y}_{l-1}, \varvec{v}_l)\). Note that the function \(g(\cdot )\) as well as other parameters in the encoder/decoder are learnt from data. Figure 2 illustrates these steps, though in the context of hierarchical LSTMs.

3.2 Retrospective-Encoder Sequence-to-sequence (re-seq2seq)

We propose several important extensions to the original seq2seq model: hierarchical modeling of frames and shots, and new loss functions for training.

Hierarchical Modeling. As shown in [42, 55, 66], the ideal length of video for LSTM modeling is less than 100 frames despite LSTMs’ ability to model long-range dependencies. Thus, it is challenging to model long videos. To this end, we leverage the hierarchical structures in a video [66] to capture dependencies across longer time spans.

As shown in Fig. 2, there are two encoder layers made of LSTM units for modeling frames and shots, respectively. The first layer is responsible for modeling at the frame level and yielding a representation for all the frames in the current shot. This representation is then fed as input to the second LSTM layer. The final output of the second layer is then treated as the embedding for the whole video as it combines all the information from all the shots.

Concretely, for the first layer LSTMs, the input is \({\varvec{x}}_t\), the feature vector for the tth frame. Assuming t is within a shot \(\varvec{b}_b\), the hidden state of this layer’s LSTM unit is \(\varvec{h}^{(1)}_t\), encoding all frames from the beginning of the shot boundary by computing over the current feature \({\varvec{x}}_t\) and the previous hidden state \(\varvec{h}^{(1)}_{t-1}\).

When t passes the shot’s ending boundary, we denote the final hidden state of the LSTM unit as \(\varvec{s}_b\), the encoding vector for the current shot \(\varvec{b}_b\). The LSTM unit’s memory \(\varvec{c}^{(1)}_t\) and the initial hidden state \(\varvec{h}^{(1)}_t\) are then reset to \(\varvec{c}^{(1)}_0\) and \(\varvec{h}^{(1)}_0\) which are both zero vectors in our model (one can also learn them).

After all the frames are processed, we have a sequence of encodings \(\varvec{S}= \{\varvec{s}_b\}\) for \(b=1, 2, \cdots , \mathsf {B}\). We construct another LSTM layer over \(\varvec{S}\). The hidden states of this layer are denoted by \(\varvec{h}^{(2)}_b\). Of particular importance is \(\varvec{h}^{(2)}_\mathsf {B}\) which is regarded as the encoding vector for the whole video. (One can also introduce more layers for finer-grained modeling [9], and we leave that for future work.)

The decoder layer is similar to the one in a standard non-hierarchical LSTM (cf. Fig. 2). It does not have any input (except \(\varvec{h}^{(2)}_\mathsf {B}\) initializes the first LSTM unit in the layer), and its output is denoted by \(\varvec{y}_l\) for \(l=1, 2, \cdots , \mathsf {L}\). Its hidden states are denoted by \(\varvec{v}_l\), which is a function of the previous hidden state \(\varvec{v}_{l-1}\) and output \(\varvec{y}_{l-1}\). The output \(\varvec{y}_l\) is parameterized as a function of \(\varvec{v}_l\).

Regression Loss Function for Matching Summaries. In supervised learning for video summarization, we maximize the accuracy of whether a particular frame/subshot is selected. To this end, one would set the output of the decoder as a binary variable and use a cross-entropy loss function, cf. [65] for details.

In our model, however, we intend to treat the outputs as a shortened video sequence and regard \(\varvec{y}_l\) as visual feature vectors (of shots). Thus, the cross-entropy loss function is not applicable. Instead, we propose the following regression loss

$$\begin{aligned} \ell _{\textsc {summary}} = \sum _i^\mathsf {L}\Vert \varvec{y}_l - \varvec{g}_l\Vert _2^2, \end{aligned}$$
(1)

where \(\varvec{g}_l\) is a target vector corresponding to the l-th shot in the ground-truth summary \(\varvec{Z}\). Suppose this shot corresponds to the \(b_l\)-th shot in the sequence of shots \(\varvec{B}\), where \(b_l\) is between 1 and \(\mathsf {B}\). We then compose \(\varvec{g}_l\) as the concatenation of two vectors: (1) the encoding \(\varvec{s}_{b_l}\) of the \(b_l\)-th shot as computed by the first LSTM layer, and (2) the average of frame-level vectors \({\varvec{x}}_t\) within the \(b_l\)-th shot. We use \(\bar{{\varvec{x}}}_{b_l}\) to denote this average. \(\varvec{g}_l\) is thus given by

$$\begin{aligned} \varvec{g}_l = [ \varvec{s}_{b_l}\ \ \bar{{\varvec{x}}}_{b_l}]. \end{aligned}$$
(2)

This form of \(\varvec{g}_l\) is important. While the goal is to generate outputs \(\varvec{y}_l\) closely matching the encodings of shots, we could obtain trivial solutions where the LSTMs learn to encode shots as a constant vector and to output a constant vector as the summary. The inclusion of the video’s raw feature vectors in the learning effectively eliminates trivial solutions.

Embedding Matching for the Summary and the Original. The intuition behind our modeling is that the outputs should convey the same amount of information as the inputs. For summarization, this is precisely the goal: a good summary should be such that after viewing the summary, users would get about the same amount of information as if they had viewed the original video.

How to measure and characterize the amount of information conveyed in the original sequence and the summary? Recall that the basic assumption about the encoder LSTMs is that they compress semantic information of their inputs into a semantic embedding, namely \(\varvec{h}^{(2)}_{\mathsf {B}}\), the final hidden stateFootnote 1. Likewise, if we add another encoder to the decoder output \(\varvec{Y}\), then this new encoder should also be able to compress it into a semantic embedding with its final hidden state \(\varvec{u}_{\mathsf {L}}\). Figure 2 illustrates the new re-seq2seq model structure.

To learn this “retrospective” encoder, we use the following loss

$$\begin{aligned} \ell _{\textsc {match}} = \Vert \varvec{h}^{(2)}_{\mathsf {B}} - \varvec{u}_{\mathsf {L}} \Vert _2^2. \end{aligned}$$
(3)

Contrastive Embedding for Mismatched Summary and Original. We can further model the alignment between the summary and the original by adding penalty terms that penalize mismatched pairs:

$$\begin{aligned} \ell _{\textsc {mismatch}}&= \sum _{\varvec{h}'} [m + \Vert \varvec{h}^{(2)}_{\mathsf {B}} - \varvec{u}_{\mathsf {L}} \Vert _2^2 - \Vert \varvec{h}' - \varvec{u}_{\mathsf {L}} \Vert _2^2]_+ \nonumber \\&+ \sum _{\varvec{u}'} [m + \Vert \varvec{h}^{(2)}_{\mathsf {B}} - \varvec{u}_{\mathsf {L}} \Vert _2^2 - \Vert \varvec{h}^{(2)}_{\mathsf {B}} - \varvec{u}' \Vert _2^2]_+, \end{aligned}$$
(4)

where \(\varvec{h}'\) (or \(\varvec{u}'\)) is the hidden state in the shot-level LSTM layer (or the re-encoder LSTM layer) from a video other than \(\varvec{h}^{(2)}_{\mathsf {B}}\) (or a summary other than \(\varvec{u}_{\mathsf {L}}\)). \(m >0\) is a margin parameter and \([\cdot ]_+ = \max (0, \cdot )\) is the standard hinge loss function. In essence this loss function aims to push apart mismatched summaries and videos.

Final Training Objective. We train the models by balancing the different types of loss functions

$$\begin{aligned} \ell = \ell _{\textsc {summary}} + \lambda \ell _{\textsc {match}} + \eta \ell _{\textsc {mismatch}}, \end{aligned}$$
(5)

where the \(\lambda \) and \(\eta \) are tradeoff parameters.

Note that neither \(\ell _{\textsc {match}}\) nor \(\ell _{\textsc {mismatch}}\) requires human annotated summaries. Thus they can also be used to incorporate unannotated video data which are disjoint from the annotated data from which \(\ell _{\textsc {summary}}\) is computed. Our empirical results will show that learning with these two loss terms noticeably improves learning with only the discriminative loss.

The specific forms of \(\ell _{\textsc {match}}\) and \(\ell _{\textsc {mismatch}}\) are also reminiscent of metric learning (for multi-way classification) [5]. In particular, we transform both the summary and the video into vectors (through a series of encoder/decoder LSTMs) and perform metric learning in that abstract semantic space. However, different from traditional metric learning methods, we also need to learn to infer the desired representations of the structured objects (i.e. sequences of frames).

Implementation Details. Throughout our empirical studies, we use standard LSTM units. The dimensions of the hidden units for all LSTM units are 256. All LSTM parameters are randomly initialized with a uniform distribution in \([-0.05, 0.05]\). The models are trained to converge with Adam [28] with an initial learning rate \(4e-4\), and mini-batch size of 10. All experiments are on a single GPU. Please refer to the Suppl. for more details.

4 Experiments

We first introduce the experimental setting (datasets, features, metrics) in Sect. 4.1. We then present the main quantitative results in the supervised learning setting to demonstrate the advantages of the proposed approach over existing methods in Sect. 4.2. We further validate the proposed approach in the semi-supervised setting in Sect. 4.3. We perform ablation studies and analyze strengths and weaknesses of our approach in Sects. 4.4 and 4.5.

4.1 Setup

Datasets. We evaluate on 3 datasets. The first two have been widely used to benchmark summarization tasks [24, 38, 65, 68]: SumMe [19] and TVSum [50]. SumMe consists of 25 user videos of a variety of events such as holidays and sports. TVSum contains 50 videos downloaded from YouTube in 10 categories. Both datasets provide multiple user-annotated summaries per video.

Following [65], we also use Youtube [11] and Open Video Project (OVP) [1, 11] as auxiliary datasets to augment the training data. We use the same set of features, i.e. each frame \({\varvec{x}}_i\) is represented by the output of the penultimate layer (pool 5) of GoogLeNet [53] (1024 dimensions). As pointed out in [65], the limited amount of annotated data limits the applicability of supervised learning techniques. Thus, we focus on the augmented setting (if not specified otherwise) as in that work and other follow-up onesFootnote 2. In this setting, 20% of the dataset is used for evaluation, \(20\%\) is used for validation (in our experiments) and the other \(60\%\) is combined with auxiliary datasets for training. We tune hyperparameters, e.g. \(\lambda \) and \(\eta \), w.r.t. the performance on the validation set.

We also demonstrate our model on a third dataset VTW [62] which is large-scale and originally proposed for video highlight detection. In [66], it is re-targeted for video summarization by converting the highlights into keyshots. VTW collects user-generated videos which are mostly shorter than ones in SumMe and TVSum. The dataset is split into 1500 videos for training and 500 videos for testing as in [66]. We have not been able to confirm the split details so our results in this paper are not directly comparable to their reported ones.

Shot Boundary Generation. As mentioned in Sect. 3, shot boundaries for each video are required for both training and testing. However, none of the datasets used in the experiments are annotated with ground-truth shot boundaries: VTW is annotated only for keyshots, SumMe is annotated by multiple users for keyshots, and TVSum is annotated by fix-length intervals (2 s). To this end, we train a single-layer LSTM for shot-boundary detection with another disjoint dataset, CoSum [8]. CoSum has 51 videos with human annotated shot-boundaries. The LSTM has \(256-\)dim hidden units and is trained for 150 epochs with learning rate \(4e-4\). We then threshold the predictions on any new videos to detect the boundaries. The best thresholds are determined by the summarization performance on the validation set. Please refer to details on shot boundary detection in the Suppl.

Evaluation. Given the heterogeneous video types and summary formats in these datasets, we follow the procedures outlined in [65, 66] to prepare training, validation, and evaluation data. In particular, we set the threshold of the total duration of keyshots as \(15\%\) of the original video length (for all datasets), following the protocols in [19, 50, 66]. Then we compare the generated summary A to the user summary B for evaluation, by computing the precision (P) and recall (R), according to the temporal overlap between the two, as well as their harmonic mean F-score [18, 19, 38, 50, 65, 66]. Higher scores are better. Please refer to the Suppl. for details on the performance and evaluation with user summaries.

Table 1. Performance (F-score) of various supervised video summarization methods on three datasets. Published results are denoted in italic; our implementations are in normal font. Nonzero values of \(\lambda \) and \(\eta \) represent contributions from relevant terms in our models.

4.2 Supervised Learning Results

In Table 1, we compare our approach to several state-of-the-art supervised methods for video summarization. We report published results in the table as well as results from our implementation of [66]. Only the best variants of all methods are quoted and presented. We have implemented a strong baseline re-seq2seq (\(\lambda =0, \eta =0\)), trained with the objective function Eq. (5) described in Sect. 3. This baseline is different from the LSTM-based seq2seq models in [65] where the model is frame-based, to which we refer as seq2seq (frame only). re-seq2seq (\(\lambda =0, \eta =0\)) has the advantage of hierarchical modeling. Optimal \(\lambda ^*\) and \(\eta ^*\) are tuned over the validation set and \(\lambda ^*=0.1, 0.1, 0.2\) and \(\eta ^*=0.15, 0.1, 0.2\) for SumMe, TVSum, and VTW, respectively. The cells with red-colored numbers indicate the best performing methods in each column.

Main Results. Our approach re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)) performs the best on all 3 datasets. Hierarchical modeling is clearly advantageous, evidenced by the performance of all our model variations and [66]. Our model re-seq2seq (\(\lambda =0, \eta =0)\) is slightly worse than [66], most likely due to the fact we use regression as summary loss while they use cross-entropy. Note that regression loss is needed in order to incorporate matching and mismatching losses in our model, cf. Eq. (5). The advantage of incorporating the retrospective encoder loss is clearly evidenced.

Table 2. F-scores on the TVSum dataset in the semi-supervised learning setting. n indicates the number of unannotated videos used for training.
Table 3. Performance of our model with different types of shot boundaries.

4.3 Semi-Supervised Learning Results

Next we carry out experiments to show that the proposed approach can benefit from both unlabeled and labeled video data. For labeled data, we use the same train and test set as in the augmented setting, i.e. OVP + Youtube + SumMe + \(80\%\) TVSum, and \(20\%\) TVSum, respectively. For unlabeled data, we randomly sample n videos from the VTW dataset and ignore their annotations. We investigate two possible means of semi-supervised training:

  1. (1)

    pre-training: the unlabeled data are used to pre-train re-seq2seq to minimize \(\ell _{\textsc {match}}\) and \(\ell _{\textsc {mismatch}}\) only. The pre-trained model is further fine-tuned with labeled training data to minimize Eq. (5).

  2. (2)

    joint-training: We jointly train the model with labeled training data and unlabeled data: we minimize Eq. (5) for labeled data, and minimize \(\ell _{\textsc {match}}\) and \(\ell _{\textsc {mismatch}}\) for unlabeled data.

Note that the test set is only for testing and not used as either labeled or unlabeled data during training, which is different from the transductive setting [38]. Results are shown in Table 2. In general, both pre-training and joint-training show improvements over supervised learning, and joint-training seems slightly better with more unlabeled data. Results are also encouraging in showing that more unlabeled data can help improve more.

Table 4. Performances of transductive setting on SumMe and TVSum.
Table 5. Performance of the proposed approach with different choices of \(\lambda \) and \(\eta \)

4.4 Ablation Study

Shot-Boundaries. Shot boundaries play an important role in keyshot-based summarization methods. In this paper we learn an LSTM to infer the shot boundaries, while in [65] an unsupervised shot boundary detection method, KTS [45], is applied. Table 3 reports the performances of our model with shot boundaries generated by KTS and the learned LSTM, respectively. The main observation is better shot boundary detection in general improves summarization.

Transductive Setting. To make a fair comparison to [38], we next perform our model in the transductive setting, where the testing data are included in computing the two new loss terms \(\ell _{\textsc {match}}\) and \(\ell _{\textsc {mismatch}}\). The results are shown in Table 4 and they are clearly stronger than those in the supervised setting (Table 1). One possible interpretation for this case is that our model maps the video and its summary in close proximity while the reconstruction from a summary to the original video in [38] may be lossy or even unattainable.

Contributions of Each Loss Term. Table 5 reports experiments of the proposed approach with different combinations of \(\ell _{\textsc {match}}\) and \(\ell _{\textsc {mismatch}}\) through their balancing parameters, i.e. \(\lambda \) and \(\eta \). To summarize, jointly minimizing both loss terms brings the state-of-the-art performance on all datasets. Furthermore, the performances of different combinations of loss terms are consistent across the 3 datasets: using \(\ell _{\textsc {mismatch}}\) alone gets the same or slightly better performances compared to using \(\ell _{\textsc {match}}\) alone, while combining both of them always obtains the best performance. Please refer to the Suppl. for model details.

Other Detailed Analysis in Suppl. We summarize additional discussions as follows. We show that summaries by our approach obtain comparable diversity to ones by dppLSTM [65]. We also show that our approach outperforms the autoencoder-based method [60]. We further analyze the correlation between our approach and recent works [48, 49] on the query-focused summarization.

Fig. 3.
figure 3

Exemplar videos and predicted summaries by re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)) () and re-seq2seq (\(\lambda =0, \eta =0\)) (). Pictures on the top are sampled from the video and ones in the bottom are sampled from the corresponding summary. The ground-truth importance scores are shown as gray background. See text for details. (Color figure online)

4.5 Qualitative Results and Analysis

How Does re-seq2seq Summarize Differently than Regular seq2seq? We examine a few exemplar video summarization results in Fig. 3 to shed light on how the learning objective of re-seq2seq affects summarization results. Our approach aims to reduce the difference in semantic embeddings for both input videos and output summaries, cf. Eq. (5). Balancing the need to match between the outputs and human summaries, it would be sensible for our approach to summarize broadly. This will result in comprehensive coverage of visual features, and thus increase the chance of more similar embeddings. The opposite strategy of selecting from a concentrated area is unlikely to yield high similarity as the selected frames are unlikely to provide a sufficient coverage of the original.

Figure 3 precisely highlights the strategy adopted by our approach. The video of Fig. 3(a) is about bike parades. re-seq2seq (\(\lambda =0, \eta =0\)) summarizes the mid-section of the video, but completely misses the important part in the beginning, which tells us the parades actually start from suburb via a bridge to downtown. In contrast, re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)) selects broadly from video shots, which show much better consensus with the video. In Fig. 3(b), however, re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)) underperforms (slightly) re-seq2seq (\(\lambda =0, \eta =0\)). The video depicts a flash mob in Copenhagen. re-seq2seq (\(\lambda =0, \eta =0\)) gets a better F-score by focusing on the mid-section of the video where there are a lot of human activities and is able to correctly get the major events in that region. re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)), on the other hand, spreads out its selection and takes only a small part of the major event compared to re-seq2seq (\(\lambda =0, \eta =0\)).

While more error analysis is desirable, these preliminary evidences seem to suggest that re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)) would work well for videos that depict various scenes and activities that follow a storyline. In particular, it might not work well with “bursty videos” where there are interesting but short shots of videos scattered in the middle of a large number of frames with non-essential information likely to be discarded when summarized.

Fig. 4.
figure 4

t-SNE visualization of semantic encodings of videos (denoted as \(\bullet \)) and their summaries (denoted as \(\varvec{\times }\)). Corresponding pairs are in the same color. The closer they are the better. Each dashed ellipsoid indicates that a video is the nearest neighbor to its summary after embedding. See text for details.

Can re-seq2seq Lead to Semantically Similar Embeddings Between the Video and Summary? Here we evaluate how well the video and summary can be embedded in close proximity. Videos used here are sampled from the TVSum dataset. For re-seq2seq (\(\lambda =0, \eta =0\)), we input the summary to the same encoder as for videos and obtain the outputs of the encoder as the embedding, and in re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)), we collect the embedding for the summary from the retrospective LSTM encoder. We then use t-SNE [37] to visualize the embeddings in 2d space as shown in Fig. 4. We use circles to denote video embeddings, and crosses for summary embeddings. Each video-summary pair is marked by the same color.

We can clearly observe that the video and its summary are mostly embedded much closer by re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)) (Fig. 4(b)) than ones by re-seq2seq (\(\lambda =0, \eta =0\)) (Fig. 4(a)). In particular the video embedded by re-seq2seq (\(\lambda =\lambda ^*, \eta =\eta ^*\)) mostly has its corresponding summary as the nearest neighbor, while this is usually not the case in ones by re-seq2seq (\(\lambda =0, \eta =0\)). Moreover, embeddings of videos and summaries in Fig. 4(a) are ‘clustered’ together compared to ones in Fig. 4(b), where different pairs of video and summary are relatively far away from each other. This shows the proposed approach embeds a summary and its original video into similar locations, while pushing apart mismatched summaries and original videos.

5 Conclusion

We propose a novel sequence-to-sequence learning model for video summarization that not only minimizes the discriminative loss for matching the generated and target summaries, but also embeds corresponding video and summary pairs in close proximity in an abstract semantic space. The proposed approach exploits both labeled and unlabeled videos to derive semantic embeddings. Extensive experimental results on multiple datasets show the advantage of our approach over existing methods in both supervised and semi-supervised settings. In the future, we plan to explore more delicate strategies to combine unlabeled data during training to improve summarization performance.