1 Introduction

Recently, it has been attracting much interest in extracting the representative visual elements from a video for sharing on social media, which aims to effectively express the semantics of the original lengthy video. However, this task, often referred to as video summarization, is laborious, subjective and challenging since videos usually exhibit very complex semantic structures, including diverse scenes, objects, actions and their complex interactions.

A noticeable trend appeared in recent years is to use the deep neural networks (DNNs) [10, 44] for video summarization since DNNs have made significant progress in various video understanding tasks [2, 12, 19]. However, annotations used in the video summarization task are in the form of frame-wise labels or importance scores, collecting a large number of annotated videos demands tremendous effort and cost. Consequently, the widely-used benchmark datasets [1, 31] only cover dozens of well-annotated videos, which becomes a prominent stumbling block that hinders the further improvement of DNNs based summarization techniques. Meanwhile, annotations for summarization task are subjective and not consistent across different annotators, potentially leading to overfitting and biased models. Therefore, the advanced studies toward taking advantage of augmented data sources such as web images [13], GIFs [10] and texts [23], which are complimentary for the summarization purpose.

To drive the techniques along with this direction, we consider an efficient weakly-supervised setting of learning summarization models from a vast number of web videos. Compared with other types of auxiliary source domain data for video summarization, the temporal dynamics in these user-edited “templates” offer rich information to locate the diverse but semantic-consistent visual contents which can be used to alleviate the ambiguities in small-size summarization. These short-form videos are readily available from web repositories (e.g., YouTube) and can be easily collected using a set of topic labels as search keywords. Additionally, these web videos have been edited by a large community of users, the risk of building a biased summarization model is significantly reduced. Several existing works [1, 21] have explored different strategies to exploit the semantic relatedness between web videos and benchmark videos. So motivated, we aim to effectively utilize the large collection of weakly-labelled web videos in learning more accurate and informative video representations which: (i) preserve essential information within the raw videos; (ii) contain discriminative information regarding the semantic consistency with web videos. Therefore, the desired deep generative models are necessitated to capture the underlying latent variables and make practical use of web data and benchmark data to learn abstract and high-level representations.

To this end, we present a generative framework for summarizing videos in this paper, which is illustrated in Fig. 1. The basic architecture consists of two components: a variational autoencoder (VAE) [14] model for learning the latent semantics from web videos; and a sequence encoder-decoder with attention mechanism for summarization. The role of VAE is to map the videos into a continuous latent variable, via an inference network (encoder), and then use the generative network (decoder) to reconstruct the input videos conditioned on samples from the latent variable. For the summarization component, the association is temporally ambiguous since only a subset of fragments in the raw video is relevant to its summary semantics. To filter out the irrelevant fragments and identify informative temporal regions for the better summary generation, we exploit the soft attention mechanism where the attention vectors (i.e., context representations) of raw videos are obtained by integrating the latent semantics trained from web videos. Furthermore, we provide a weakly-supervised semantic matching loss instead of reconstruction loss to learn the topic-associated summaries in our generative framework. In this sense, we take advantage of potentially accurate and flexible latent variable distribution from external data thus strengthen the expressiveness of generated summary in the encoder-decoder based summarization model. To evaluate the effectiveness of the proposed method, we comprehensively conduct experiments using different training settings and demonstrate that our method with web videos achieves significantly better performance than competitive video summarization approaches.

Fig. 1.
figure 1

An illustration of the proposed generative framework for video summarization. A VAE model is pre-trained on web videos (purple dashed rectangle area); And the summarization is implemented within an encoder-decoder paradigm by using both the attention vector and the sampled latent variable from VAE (red dashed rectangle area). (Color figure online)

2 Related Work

Video Summarization is a challenging task which has been explored for many years [18, 37] and can be grouped into two broad categories: unsupervised and supervised learning methods. Unsupervised summarization methods focus on low-level visual cues to locate the important segments of a video. Various strategies have been investigated, including clustering [7, 8], sparse optimizations [3, 22], and energy minimization [4, 25]. A majority of recent works mainly study the summarization solutions based on the supervised learning from human annotations. For instance, to make a large-margin structured prediction, submodular functions are trained with human-annotated summaries [9]. Gygli et al. [8] propose a linear regression model to estimate the interestingness score of shots. Gong et al. [5] and Sharghi et al. [28] learn from user-created summaries for selecting informative video subsets. Zhang et al. [43] show summary structures can be transferred between videos that are semantically consistent. More recently, DNNs based methods have been applied for video summarization with the help of pairwise deep ranking model [42] or recurrent neural networks (RNNs) [44]. However, these approaches assume the availability of a large number of human-created video-summary pairs or fine-grained temporal annotations, which are in practice difficult and expensive to acquire. Alternatively, there have been attempts to leverage information from other data sources such as web images, GIFs and texts [10, 13, 23]. Chu et al. [1] propose to summarize shots that co-occur among multiple videos of the same topic. Panda et al. [20] present an end-to-end 3D convolutional neural network (CNN) architecture to learn summarization model with web videos. In this paper, we also consider to use the topic-specific cues in web videos for better summarization, but adopt a generative summarization framework to exploit the complementary benefits in web videos.

Video Highlight Detection is highly related to video summarization and many earlier approaches have primarily been focused on specific data scenarios such as broadcast sport videos [27, 35]. Traditional methods usually adopt the mid-level and high-level audio-visual features due to the well-defined structures. For general highlight detection, Sun et al. [32] employ a latent SVM model detect highlights by learning from pairs of raw and edited videos. The DNNs also have achieved big performance improvement and shown great promise in highlight detection [41]. However, most of these methods treat highlight detection as a binary classification problem, while highlight labelling is usually ambiguous for humans. This also imposes heavy burden for humans to collect a huge amount of labelled data for training DNN based models.

Deep Generative Models are very powerful in learning complex data distribution and low-dimensional latent representations. Besides, the generative modelling for video summarization might provide an effective way to bring scalability and stability in training a large amount of web data. Two of the most effective approaches are VAE [14] and generative adversarial network (GAN) [6]. VAE aims at maximizing the variational lower bound of the observation while encouraging the variational posterior distribution of the latent variables to be close to the prior distribution. A GAN is composed of a generative model and a discriminative model and trained in a min-max game framework. Both VAE and GAN have already shown promising results in image/frame generation tasks [17, 26, 38]. To embrace the temporal structures into generative modelling, we propose a new variational sequence-to-sequence encoder-decoder framework for video summarization by capturing both the video-level topics and web semantic prior. The attention mechanism embedded in our framework can be naturally used as key shots selection for summarization. Most related to our generative summarization is the work of Mahasseni et al. [16], who present an unsupervised summarization in the framework of GAN. However, the attention mechanism in their approach depends solely on the raw video itself thus has the limitation in delivering diverse contents in video-summary reconstruction.

3 The Proposed Framework

As an intermediate step to leverage abundant user-edited videos on the Web to assist the training of our generative video summarization framework, in this section, we first introduce the basic building blocks of the proposed framework, called variational encoder-summarizer-decoder (VESD). The VESD consists of three components: (i) an encoder RNN for raw video; (ii) an attention-based summarizer for raw video; (iii) a decoder RNN for summary video.

Following the video summarization pipelines in previous methods [24, 44], we first perform temporal segmentation and shot-level feature extraction for raw videos using CNNs. Each video \(\mathcal {X}\) is then treated as a sequential set of multiple non-uniform shots, where \(\varvec{x}_{t}\) is the feature vector of the t-th shot in video representation \(\varvec{X}\). Most supervised summarization approaches aim to predict labels/scores which indicate whether the shots should be included in the summary, however, suffering from the drawbacks of selection of redundant visual contents. For this reason, we formulate video summarization as video generation task which allows the summary representation \(\varvec{Y}\) does not necessarily be restricted to a subset of \(\varvec{X}\). In this manner, our method centres on the semantic essence of a video and can exhibit the high tolerance for summaries with visual differences. Following the encoder-decoder paradigm [33], our summarization framework is composed of two parts: the encoder-summarizer is an inference network \(q_{\varvec{\phi }}(\varvec{a}|\varvec{X},\varvec{z})\) that takes both the video representation \(\varvec{X}\) and the latent variable \(\varvec{z}\) (sampled from the VAE module pre-trained on web videos) as inputs. Moreover, the encoder-summarizer is supposed to generate the video content representation \(\varvec{a}\) that captures all the information about \(\varvec{Y}\). The summarizer-decoder is a generative network \(p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})\) that outputs the summary representation \(\varvec{Y}\) based on the attention vector \(\varvec{a}\) and the latent representation \(\varvec{z}\).

3.1 Encoder-Summarizer

To date, modelling sequence data with RNNs has been proven successful in video summarization [44]. Therefore, for the encoder-summarizer component, we employ a pointer RNN, e.g., a bidirectional Long Short-Term Memory (LSTM), as an encoder that processes the raw videos, and a summarizer aims to select the shots of most probably containing salient information. The summarizer is exactly the attention-based model that generates the video context representation by attending to the encoded video features.

In time step t, we denote \(\varvec{x}_{t}\) as the feature vector for the t-th shot and \(\varvec{h}_{t}^{e}\) as the state output of the encoder. It is known that \(\varvec{h}_{t}^{e}\) is obtained by concatenating the hidden states from each direction:

$$\begin{aligned} \varvec{h}_{t}^{e}=[\text {RNN}_{\overrightarrow{enc}}(\overrightarrow{\varvec{h}_{t-1}},\varvec{x}_{t});\text {RNN}_{\overleftarrow{enc}}(\overleftarrow{\varvec{h}_{t+1}},\varvec{x}_{t})]. \end{aligned}$$
(1)

The attention mechanism is proposed to compute an attention vector \(\varvec{a}\) of input sequence by summing the sequence information \(\{\varvec{h}_{t}^{e}, t=1,\dots ,|\varvec{X}|\}\) with the location variable \(\varvec{\alpha }\) as follows:

$$\begin{aligned} \varvec{a}=\sum _{t=1}^{|\varvec{X}|}\alpha _{t}\varvec{h}_{t}^{e}, \end{aligned}$$
(2)

where \(\alpha _{t}\) denotes the t-th value of \(\varvec{\alpha }\) and indicates whether the t-th shot is included in summary or not. As mentioned in [40], when using the generative modelling on the log-likelihood of the conditional distribution \(p(\varvec{Y}|\varvec{X})\), one approach is to sample attention vector \(\varvec{a}\) by assigning the Bernoulli distribution to \(\varvec{\alpha }\). However, the resultant Monte Carlo gradient estimator of the variational lower-bound objective requires complicated variance reduction techniques and may lead to unstable training. Instead, we adopt a deterministic approximation to obtain \(\varvec{a}\). That is, we produce an attentive probability distribution based on \(\varvec{X}\) and \(\varvec{z}\), which is defined as \(\alpha _{t}:=p(\alpha _{t}|\varvec{h}_{t}^{e},\varvec{z})=\text {softmax}(\varphi _{t}([\varvec{h}_{t}^{e};\varvec{z}]))\), where \(\varvec{\varphi }\) is a parameterized potential typically based on a neural network, e.g., multilayer perceptron (MLP). Accordingly, the attention vector in Eq. (2) turns to:

$$\begin{aligned} \varvec{a}=\sum _{t=1}^{N}p(\alpha _{t}|\varvec{h}_{t}^{e},\varvec{z})\varvec{h}_{t}^{e}, \end{aligned}$$
(3)

which is fed to the decoder RNN for summary generation. The attention mechanism extracts an attention vector \(\varvec{a}\) by iteratively attending to the raw video features based on the latent variable \(\varvec{z}\) learned from web data. In doing so the model is able to adapt to the ambiguity inherent in summaries and obtain salient information of raw video through attention. Intuitively, the attention scores \(\alpha _{t}\)s are used to perform shot selection for summarization.

3.2 Summarizer-Decoder

We specify the summary generation process as \(p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})\) which is the conditional likelihood of the summary given the attention vector \(\varvec{a}\) and the latent variable \(\varvec{z}\). Different with the standard Gaussian prior distribution adopted in VAE, \(p(\varvec{z})\) in our framework is pre-trained on web videos to regularize the latent semantic representations of summaries. Therefore, the summaries generated via \(p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})\) are likely to possess diverse contents. In this manner, \(p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})\) is then reconstructed via a RNN decoder at each time step t: \(p_{\varvec{\theta }}(\varvec{y}_{t}|\varvec{a},[\varvec{\mu }_{\varvec{z}},\varvec{\sigma }^{2}_{\varvec{z}}])\), where \(\varvec{\mu }_{\varvec{z}}\) and \(\varvec{\sigma }_{\varvec{z}}\) are nonlinear functions of the latent variables specified by two learnable neural networks (detailed in Sect. 4).

3.3 Variational Inference

Given the proposed VESD model, the network parameters \(\{\varvec{\phi },\varvec{\theta }\}\) need to be updated during inference. We marginalize over the latent variables \(\varvec{a}\) and \(\varvec{z}\) by maximizing the following variational lower-bound \(\mathcal {L}(\varvec{\phi },\varvec{\theta })\)

$$\begin{aligned} \mathcal {L}(\varvec{\phi },\varvec{\theta })= & {} \mathbb {E}_{q_{\varvec{\phi }}(\varvec{a},\varvec{z}|\varvec{X},\varvec{Y})}[\log p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})-\text {KL}(q_{\varvec{\phi }}(\varvec{a},\varvec{z}|\varvec{X},\varvec{Y})|p(\varvec{a},\varvec{z}))], \end{aligned}$$
(4)

where \(\text {KL}(\cdot )\) is the Kullback-Leibler divergence. We assume the joint distribution of the latent variables \(\varvec{a}\) and \(\varvec{z}\) has a factorized form, i.e., \(q_{\varvec{\phi }}(\varvec{a},\varvec{z}|\varvec{X},\varvec{Y})=q_{\varvec{\phi }^{(\varvec{z})}}(\varvec{z}|\varvec{X},\varvec{Y})q_{\varvec{\phi }^{(\varvec{a})}}(\varvec{a}|\varvec{X},\varvec{Y})\), and notice that \(p(\varvec{a})=q_{\varvec{\phi }^{(\varvec{a})}}(\varvec{a}|\varvec{X},\varvec{Y})\) is defined with a deterministic manner in Sect. 3.1. Therefore the variational objective in Eq. (4) can be derived as:

$$\begin{aligned} \mathcal {L}(\varvec{\phi },\varvec{\theta })&=\mathbb {E}_{q_{\varvec{\phi }^{(\varvec{z})}}(\varvec{z}|\varvec{X},\varvec{Y})}[\mathbb {E}_{q_{\varvec{\phi }^{(\varvec{a})}}(\varvec{a}|\varvec{X},\varvec{Y})}\log p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})\nonumber \\&\quad -\text {KL}(q_{\varvec{\phi }^{(\varvec{a}})}(\varvec{a}|\varvec{X},\varvec{Y})||p(\varvec{a}))]+\text {KL}(q_{\varvec{\phi }^{(\varvec{z})}}(\varvec{z}|\varvec{X},\varvec{Y})||p(\varvec{z}))\nonumber \qquad \\&=\mathbb {E}_{q_{\varvec{\phi }}(\varvec{z}|\varvec{X},\varvec{Y})}[\log p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})]+\text {KL}(q_{\varvec{\phi }}(\varvec{z}|\varvec{X},\varvec{Y})||p(\varvec{z})). \end{aligned}$$
(5)

The above variational lower-bound offers a new perspective for exploiting the reciprocal nature of raw video and its summary. Maximizing Eq. (5) strikes a balance between minimizing generation error and minimizing the KL divergence between the approximated posterior \(q_{\varvec{\phi }^{(\varvec{z})}}(\varvec{z}|\varvec{X},\varvec{Y})\) and the prior \(p(\varvec{z})\).

4 Weakly-Supervised VESD

In practice, as only a few video-summary pairs are available, the latent variable \(\varvec{z}\) cannot characterize the inherent semantic in video and summary accurately. Motivated by the VAE/GAN model [15], we explore a weakly-supervised learning framework and endow our VESD the ability to make use of rich web videos for the latent semantic inference. The VAE/GAN model extends VAE with the discriminator network in GAN, which provides a method that constructs the latent space from inference network of data rather than random noises and implicitly learns a rich similarity metric for data. The similar idea has also been investigated in [16] for unsupervised video summarization. Recall that the discriminator in GAN tries to distinguish the generated examples from real examples; Following the same spirit, we apply the discriminator in the proposed VESD which naturally results in minimizing the following adversarial loss function:

$$\begin{aligned} \mathcal {L}(\varvec{\phi },\varvec{\theta },\varvec{\psi })= & {} -\mathbb {E}_{\varvec{\hat{Y}}}[\log \text {D}_{\varvec{\psi }}(\varvec{\hat{Y}})]-\mathbb {E}_{\varvec{X},\varvec{z}}[\log (1-\text {D}_{\varvec{\psi }}(\varvec{Y}))], \end{aligned}$$
(6)

where \(\varvec{\hat{Y}}\) refers to the representation of web video. Unfortunately, the above loss function suffers from the unstable training in standard GAN models and cannot be directly extended into supervised scenario. To address these problems, we propose to employ a semantic feature matching loss for the weakly-supervised setting of VESD framework. The objective requires the representation of generated summary to match the representation of web videos under a similarity function. For the prediction of the semantic similarity, we replace \(p_{\varvec{\theta }}(\varvec{Y}|\varvec{a},\varvec{z})\) with the following sigmoid function:

$$\begin{aligned} p_{\varvec{\theta }}(c|\varvec{a},\varvec{h}^{d}(\hat{\varvec{Y}}))=\sigma (\varvec{a}^{T}\varvec{M}\varvec{h}^{d}(\hat{\varvec{Y}})), \end{aligned}$$
(7)

where \(\varvec{h}^{d}(\hat{\varvec{Y}})\) is the last output state of \(\hat{\varvec{Y}}\) in the decoder RNN and \(\varvec{M}\) is the sigmoid parameter. We randomly pick \(\varvec{\hat{Y}}\) in web videos and c is the pair relatedness label, i.e., \(c=1\) if \(\varvec{Y}\) and \(\hat{\varvec{Y}}\) are semantically matched. We can also generalize the above matching loss to multi-label case by replacing c with one-hot vector \(\varvec{c}\) whose nonzero position corresponds the matched label. Therefore, the objective (5) can be rewritten as:

$$\begin{aligned} \mathcal {L}(\varvec{\phi },\varvec{\theta },\varvec{\psi })= & {} \mathbb {E}_{q_{\varvec{\phi }}(\varvec{z})}[\log p_{\varvec{\theta }}(\varvec{c}|\varvec{a},\varvec{h}^{d}(\hat{\varvec{Y}}))]+\text {KL}(q_{\varvec{\phi }}(\varvec{z})||p(\varvec{z}|\hat{\varvec{Y}})). \end{aligned}$$
(8)

It is found that the above variational objective shares the similarity with conditional VAE (CVAE) [30] which is able to produce diverse outputs for a single input. For example, Walker et al. [39] use a fully convolutional CVAE for diverse motion prediction from a static image. Zhou and Berg [45] generate diverse time-lapse videos by incorporating conditional, twostack and recurrent architecture modifications to standard generative models. Therefore, our weakly-supervised VESD naturally embeds the diversity in video summary generation.

4.1 Learnable Prior and Posterior

In contrast to the standard VAE prior that assumes the latent variable \(\varvec{z}\) to be drawn from latent Gaussian (e.g., \(p(\varvec{z})=\mathcal {N}(\varvec{0},\varvec{I})\)), we impose the prior distribution learned from web videos which infers the topic-specific semantics more accurately. Thus we impose \(\varvec{z}\) to be drawn from the Gaussian with \(p(\varvec{z}|\hat{\varvec{Y}})=\mathcal {N}(\varvec{z}|\varvec{\mu }(\hat{\varvec{Y}}),\varvec{\sigma }^{2}(\hat{\varvec{Y}})\varvec{I})\) whose mean and variance are defined as:

$$\begin{aligned} \varvec{\mu }(\hat{\varvec{Y}})=f_{\varvec{\mu }}(\hat{\varvec{Y}}), \text {log}\varvec{\sigma }^{2}(\hat{\varvec{Y}})=f_{\varvec{\sigma }}(\hat{\varvec{Y}}), \end{aligned}$$
(9)

where \(f_{\varvec{\mu }}(\cdot )\) and \(f_{\varvec{\sigma }}(\cdot )\) denote any type of neural networks that are suitable for the observed data. We adopt two-layer MLPs with ReLU activation in our implementation.

Likewise, we model the posterior of \(q_{\varvec{\phi }}(\varvec{z}|\cdot ):=q_{\varvec{\phi }}(\varvec{z}|\varvec{X},\hat{\varvec{Y}},\varvec{c})\) with the Gaussian distribution \(\mathcal {N}(\varvec{z}|\varvec{\mu }(\varvec{X},\hat{\varvec{Y}},\varvec{c}),\varvec{\sigma }^{2}(\varvec{X},\hat{\varvec{Y}},\varvec{c})\) whose mean and variance are also characterized by two-layer MLPs with ReLU activation:

$$\begin{aligned} \varvec{\mu }=f_{\varvec{\mu }}([\varvec{a};\varvec{h}^{d}(\hat{\varvec{Y}});\varvec{c}]), \text {log}\varvec{\sigma }^{2}=f_{\varvec{\sigma }}([\varvec{a};\varvec{h}^{d}(\hat{\varvec{Y}});\varvec{c}]). \end{aligned}$$
(10)
Fig. 2.
figure 2

The variational formulation of our weakly-supervised VESD framework.

4.2 Mixed Training Objective Function

One potential issue of purely weakly-supervised VESD training objective (8) is that the semantic matching loss usually results in summaries focusing on very few shots in raw video. To ensure the diversity and fidelity of the generated summaries, we can also make use of the importance scores on partially finely-annotated benchmark datasets to consistently improves performance. For those detailed annotations in benchmark datasets, we adopt the same keyframe regularizer in [16] to measure the cross-entropy loss between the normalized ground-truth importance scores \(\varvec{\alpha }_{\varvec{X}}^{gt}\) and the output attention scores \(\varvec{\alpha }_{\varvec{X}}\) as below:

$$\begin{aligned} \mathcal {L}_{\text {score}}= & {} \text {cross-entropy}(\varvec{\alpha }_{\varvec{X}}^{gt},\varvec{\alpha }_{\varvec{X}}). \end{aligned}$$
(11)

Accordingly, we train the regularized VESD using the following objective function to utilize different levels of annotations:

$$\begin{aligned} \mathcal {L}_{\text {mixed}}=\mathcal {L}(\varvec{\phi },\varvec{\theta },\varvec{\psi },\varvec{\omega })+\lambda \mathcal {L}_{\text {score}}. \end{aligned}$$
(12)

The overall objective can be trained using back-propagation efficiently and is illustrated in Fig. 2. After training, we calculate the salience score \(\varvec{\alpha }\) for each new video by forward passing the summarization model in VESD.

5 Experimental Results

Datasets and Evaluation. We test our VESD framework on two publicly available video summarization benchmark datasets CoSum [1] and TVSum [31]. The CoSum [1] dataset consists of 51 videos covering 10 topics including Base Jumping (BJ), Bike Polo (BP), Eiffel Tower (ET), Excavators River Cross (ERC), Kids Playing in leaves (KP), MLB, NFL, Notre Dame Cathedral (NDC), Statue of Liberty (SL) and SurFing (SF). The TVSum [31] dataset contains 50 videos organized into 10 topics from the TRECVid Multimedia Event Detection task [29], including changing Vehicle Tire (VT), getting Vehicle Unstuck (VU), Grooming an Animal (GA), Making Sandwich (MS), ParKour (PK), PaRade (PR), Flash Mob gathering (FM), BeeKeeping (BK), attempting Bike Tricks (BT), and Dog Show (DS). Following the literature [9, 44], we randomly choose 80% of the videos for training and use the remaining 20% for testing on both datasets. As recommended by [1, 20, 21], we evaluate the quality of a generated summary by comparing it to multiple user-annotated summaries provided in benchmarks. Specifically, we compute the pairwise average precision (AP) for a proposed summary and all its corresponding human-annotated summaries, and then report the mean value. Furthermore, we average over the number of videos to achieve the overall performance on a dataset. For the CoSum dataset, we follow [20, 21] and compare each generated summary with three human-created summaries. For the TVSum dataset, we first average the frame-level importance scores to compute the shot-level scores, and then select the top 50% shots for each video as the human-created summary. Finally, each generated summary is compared with twenty human-created summaries. The top-5 and top-15 mAP performances on both datasets are presented in evaluation.

Web Video Collection. This section describes the details of web video collection for our approach. We treat the topic labels in both datasets as the query keywords and retrieve videos from YouTube for all the twenty topic categories. We limit the videos by time duration (less than 4 min) and rank by relevance to constructing a set of weakly-annotated videos. However, these downloaded videos are still very lengthy and noisy in general since they contain a proportion of frames that are irrelevant to search keywords. Therefore, we introduce a simple but efficient strategy to filter out the noisy parts of these web videos: (1) we first adopt the existing temporal segmentation technique KTS [24] to segment both the benchmark videos and web videos into non-overlapping shots, and utilize CNNs to extract feature within each shot; (2) the corresponding features in benchmark videos are then used to train a MLP with their topic labels (the shots do not belong to any topic label are set with background label) and perform prediction for the shots in web videos; (3) we further truncate web videos based on the relevant shots whose topic-related probability is larger than a threshold. In this way, we observe that the trimmed videos are sufficiently clean and informative for learning the latent semantics in our VAE module.

Architecture and Implementation Details. For the fair comparison with state-of-the-art methods [16, 44], we choose to use the output of pool5 layer of the GoogLeNet [34] for the frame-level feature. The shot-level feature is then obtained by averaging all the frame features within a shot. We first use the features of segmented shots on web videos to pre-train a VAE module whose dimension of the latent variable is set to 256. To build encoder-summarizer-decoder, we use a two-layer bidirectional LSTM with 1024 hidden units, a two-layer MLP with [256, 256] hidden units and a two-layer LSTM with 1024 hidden units for the encoder RNN, attention MLP and decoder RNNs, respectively. For the parameter initialization, we train our framework from scratch using stochastic gradient descent with a minibatch size of 20, a momentum of 0.9, and a weight decay of 0.005. The learning rate is initialized to 0.01 and is reduced to its 1/10 after every 20 epochs (100 epochs in total). The trade-off parameter \(\lambda \) is set to 0.2 in the mixed training objective.

5.1 Quantitative Results

Exploration Study. To better understand the impact of using web videos and different types of annotations in our method, we analyzed the performances under the following six training settings: (1) benchmark datasets with weak supervision (topic labels); (2) benchmark datasets with weak supervision and extra 30 downloaded videos per topic; (3) benchmark datasets with weak supervision and extra 60 downloaded videos per topic; (4) benchmark datasets with strong supervision (topic labels and importance scores); (5) benchmark datasets with strong supervision and extra 30 downloaded videos per topic; and (6) benchmark datasets with strong supervision and extra 60 downloaded videos per topic. We have the following key observations from Table 1: (1) Training on the benchmark data with only weak topic labels in our VESD framework performs much worse than either that of training using extra web videos or that of training using detailed importance scores, which demonstrates our generative summarization model demands a larger amount of annotated data to perform well. (2) We notice that the more web videos give better results, which clearly demonstrates the benefits of using web videos and proves the scalability of our generative framework. (3) This big improvements with strong supervision illustrate the positive impact of incorporating available importance scores for mixed training of our VESD. That is not surprising since the attention scores should be imposed to focus on different fragments of raw videos in order to be consistent with ground-truths, resulting in the summarizer with the diverse property which is an important metric in generating good summaries. We use the training setting (5) in the following experimental comparisons.

Table 1. Exploration study on training settings. Numbers show top-5 mAP scores.
Table 2. Performance comparison using different types of features on CoSum dataset. Numbers show top-5 mAP scores averaged over all the videos of the same topic.

Effect of Deep Feature. We also investigate the effect of using different types of deep features as shot representation in VESD framework, including 2D deep features extracted from GoogLeNet [34] and ResNet101 [11], and 3D deep features extracted from C3D [36]. In Table 2, we have following observations: (1) ResNet produces better results than GoogLeNet, with a top-5 mAP score improvement of 0.012 on the CoSum dataset, which indicates more powerful visual features still lead improvement for our method. We also compare 2D GoogLeNet features with C3D features. Results show that the C3D features achieve better performance over GoogLeNet features (0.765 vs 0.755) and comparable performance with ResNet101 features. We believe this is because C3D features exploit the temporal information of videos thus are also suitable for summarization.

Table 3. Experimental results on CoSum dataset. Numbers show top-5/15 mAP scores averaged over all the videos of the same topic.
Table 4. Experimental results on TVSum dataset. Numbers show top-5/15 mAP scores averaged over all the videos of the same topic.

Comparison with Unsupervised Methods. We first compare VESD with several unsupervised methods including SMRS [3], Quasi [13], MBF [1], CVS [21] and SG [16]. Table 3 shows the mean AP on both top 5 and 15 shots included in the summaries for the CoSum dataset, whereas Table 4 shows the results on TVSum dataset. We can observe that: (1) Our weakly supervised approach obtains the highest overall mAP and outperforms traditional non-DNN based methods SMRS, Quasi, MBF and CVS by large margins. (2) The most competing DNN based method, SG [16] gives top-5 mAP that is 3.5% and 1.9% less than ours on the CoSum and TVSum dataset, respectively. Note that with web videos only is better than training with multiple handcrafted regularizations proposed in SG. This confirms the effectiveness of incorporating a large number of web videos in our framework and learning the topic-specific semantics using a weakly-supervised matching loss function. (3) Since the CoSum dataset contains videos that have visual concepts shared with other videos from different topics, our approach using generative modelling naturally yields better results than that on the TVSum dataset. (4) It’s worth noticing that TVSum is a quite challenging summarization dataset because topics on this dataset are very ambiguous and difficult to understand well with very few videos. By accessing the similar web videos to eliminate ambiguity for a specific topic, our approach works much better than all the unsupervised methods by achieving a top-5 mAP of 48.1%, showing that the accurate and user-interested video contents can be directly learned from more diverse data rather than complex summarization criteria.

Comparison with Supervised Methods. We then conduct comparison with some supervised alternatives including KVS [24], DPP [5], sLstm [44], SM [9] and DSN [20] (weakly-supervised), we have the following key observations from Tables 3 and 4: (1) VESD outperforms KVS on both datasets by a big margin (maximum improvement of 7.1% in top-5 mAP on CoSum), showing the advantage of our generative modelling and more powerful representation learning with web videos. (2) On the Cosum dataset, VESD outperforms SM [9] and DSN [20] by a margin of 2.0% and 3.4% in top-5 mAP, respectively. The results suggest that our method is still better than the fully-supervised methods and the weakly-supervised method. (3) On the TVSum dataset, a similar performance gain of 2.0% can be achieved compared with all other supervised methods.

Fig. 3.
figure 3

Qualitative comparison of video summaries using different training settings, along with the ground-truth importance scores (cyan background). In the last subfigure, we can easily see that weakly-supervised VESD with web videos and available importance scores produces more reliable summaries than training on benchmark videos with only weak labels. (Best viewed in colors) (Color figure online)

5.2 Qualitative Results

To get some intuition about the different training settings for VESD and their effects on the temporal selection pattern, we visualize some selected frames on an example video in Fig. 3. The cyan background shows the frame-level importance scores. The coloured regions are the selected subset of frames using the specific training setting. The visualized keyframes for different setting supports the results presented in Table 1. We notice that all four settings cover the temporal regions with the high frame-level score. By leveraging both the web videos and importance scores in datasets, VESD framework will shift towards the highly topic-specific temporal regions.

6 Conclusion

One key problem in video summarization is how to model the latent semantic representation, which has not been adequately resolved under the “single video understanding” framework in prior works. To address this issue, we introduced a generative summarization framework called VESD to leverage the web videos for better latent semantic modelling and to reduce the ambiguity of video summarization in a principled way. We incorporated flexible web prior distribution into a variational framework and presented a simple encoder-decoder with attention for summarization. The potentials of our VESD framework for large-scale video summarization were validated, and extensive experiments on benchmarks showed that VESD outperforms state-of-the-art video summarization methods significantly.