STDC-Net: A spatial-temporal deformable convolution network for conference video frame interpolation

Hu, Jinhui; Wang, Qianrui; Li, Dengshi; Gao, Yu

doi:10.1007/s11042-023-16266-0

STDC-Net: A spatial-temporal deformable convolution network for conference video frame interpolation

1233: Robust Enhancement, Understanding and Assessment of Low-quality Multimedia Data
Open access
Published: 19 September 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

STDC-Net: A spatial-temporal deformable convolution network for conference video frame interpolation

Download PDF

Jinhui Hu^1,2^na1,
Qianrui Wang³^na1,
Dengshi Li ORCID: orcid.org/0000-0002-6712-9728³ &
…
Yu Gao³

3577 Accesses
Explore all metrics

Abstract

Video conference communication can be seriously affected by dropped frames or reduced frame rates due to network or hardware restrictions. Video frame interpolation techniques can interpolate the dropped frames and generate smoother videos. However, existing methods can not generate plausible results in video conferences due to the large motions of the eyes, mouth and head. To address this issue, we propose a Spatial-Temporal Deformable Convolution Network (STDC-Net) for conference video frame interpolation. The STDC-Net first extracts shallow spatial-temporal features by an embedding layer. Secondly, it extracts multi-scale deep spatial-temporal features through Spatial-Temporal Representation Learning (STRL) module, which contains several Spatial-Temporal Feature Extracting (STFE) blocks and downsample layers. To extract the temporal features, each STFE block splits feature maps along the temporal pathway and processes them with Multi-Layer Perceptron (MLP). Similarly, the STFE block splits the temporal features along horizontal and vertical pathways and processes them by another two MLPs to get spatial features. By splitting the feature maps into segments of varying lengths in different scales, the STDC-Net can extract both local details and global spatial features, allowing it to effectively handle large motions. Finally, Frame Synthesis (FS) module predicts weights, offsets and masks using the spatial-temporal features, which are used in deformable convolution to generate the intermediate frames. Experimental results demonstrate the STDC-Net outperforms state-of-the-art methods in both quantitative and qualitative evaluations. Compared to the baseline, the proposed method achieved a PSNR improvement of 0.13 dB and 0.17 dB on the Voxceleb2 and HDTF datasets, respectively.

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, video conferencing has become an essential tool for communication and collaboration in various industries and settings. It enables people to connect and work together seamlessly, regardless of their location. It also contributes to increased productivity, reduced travel costs, and improved work-life balance. However, when a video conference is conducted over a low-bandwidth or high-latency network, the video stream may experience dropped frames or reduced frame rates, resulting in a lower-quality video that seriously affects user experience. Video frame interpolation is often used to solve the above issue, i.e., through video frame interpolation, additional frames can be generated to fill in the gaps between the original frames, resulting in a smoother and more fluid motion. In addition, video conferences are often recorded for later playback or archival purposes, and video frame interpolation interpolates one or more frames to every two consecutive frames of the original video thus improving the smoothness of the video and creating a more seamless playback experience.

Deep learning has enabled the development of various video frame interpolation algorithms that can be classified as flow-based [1, 9, 12, 13, 15, 22, 29, 34, 38, 43, 45]and kernel-based methods [3, 6, 23, 31, 32, 36, 37]. Flow-based methods estimate the bi-direction optical flow between input and target frames to warp the input frames, which heavily depend on the accuracy of bi-direction optical flow. For instance, Xu et al. [45] used optical flow between four frames to obtain quadratic motion parameters through an analytical solution for improving the accuracy of optical flow prediction. However, when there are sudden changes in motion, such as sudden jerks, the predicted optical flow based on quadratic motion may be inaccurate, resulting in suboptimal performance of the interpolated intermediate frames. To address this issue, Dutta et al. [9] used 3D Convolutional Neural Networks (3DCNN) to estimate the motion parameters that further improved the accuracy of the optical flow. However, during video conferences, the oral region of participants is often occluded when speaking, with facial and head movements sometimes being large and nonlinear. This makes it difficult to accurately predict optical flow, which results in poor quality of the interpolated intermediate frames. In contrast, kernel-based methods learn convolution kernels to synthesize the intermediate frame. The interpolation accuracy depends on the size of the kernel, but a larger kernel requires more computation. To solve this issue, [23] predicted offsets and used deformable convolution to enlarge the range of convolution that has significantly improved performance compared with conventional video frame interpolation approaches. However, video frames are different from single images in that they contain temporal features, which refer to the motion or changes that occur over time. The feature extraction module in [23] is primarily focused on extracting spatial features. In addition, the feature extraction module consists of Convolutional Neural Networks (CNNs) that has a limited receptive field, it can only capture local features within a certain range of pixels. When a pixel experiences significant motion, such as a man shaking his head during a video conference, the limitations of CNNs in capturing global information will result in poor performance for conference video frame interpolation.

In order to solve the aforementioned issue, we propose a Spatial-Temporal Deformable Convolution Network (STDC-Net) for video frame interpolation in video conferences. Specifically, STDC-Net first uses an embedding layer to generate shallow spatial-temporal features of the input frames. Then the Spatial-Temporal Feature Learning (STFL) module generates multi-scale deep spatial-temporal features of the input frames. STFL module is an encoder-decoder architecture, the encoder has four stages, and each stage contains several Spatial-Temporal Feature Extract (STFE) blocks and a downsample layer. The STFE block is composed of Multi-Layer Perceptrons (MLPs), which are feedforward artificial neural networks consisting of multiple layers of interconnected nodes known as neurons. Each neuron receives input features, performs a weighted sum of those inputs, applies an activation function, and passes the result to the next layer, each layer learns increasingly complex representations of the input features. STFE block splits the spatial-temporal feature in the horizontal, vertical, and temporal pathway, then sends the split features into three different MLPs to learn spatial and temporal features. By splitting feature maps into different lengths in the horizontal and vertical pathways at different stages, we can obtain local and global spatial features. The decoder is realized by 3D deconvolution and 3D convolution layers to upsample the features. Besides, skip connections are used between the encoder and decoder to capture more detailed features. Finally, Frame Synthesis (FS) module predicts the adaptive kernels and offsets to sample the pixel points in each frame for generating two intermediate frames. In addition, the occlusion masks is predicted to weigh the two intermediate frames for generating the final intermediate frame. We conduct extensive experiments on Voxceleb2 [5] and HDTF [48] datasets, and the experiment results demonstrate that our method outperforms state-of-the-art methods.

The main contributions of our work are summarized as follows:

We have proposed a novel video frame interpolation algorithm specifically designed for video conferencing, effectively addressing the issues of frame drops and low video frame rates caused by network or bandwidth limitations.
We have split features into different lengths along horizontal, vertical, and temporal directions across various scales. These split features were then fed into a network that integrates 3DCNN and MLP to extract local and global features, improving offset prediction accuracy and effectively handling large motions.
We have conducted a series of comprehensive experiments on Voxceleb2 and HDTF datasets. The results indicate that the proposed method outperforms state-of-the-art methods, achieving the best performance.

The structure of this paper is outlined as follows: Related work is reviewed in Section 2. The proposed algorithm is described in Section 3. The experimental results are discussed in Section 4. Section 5 concludes the paper.

2 Related work

Video frame interpolation aims at interpolating one or more intermediate frames from two consecutive frames while maintaining spatial and temporal consistencies. Existing methods can be mainly classified into two categories: flow-based [10, 11, 14, 16, 21, 25,26,27,28, 30, 44] and kernel-based [2, 7, 8, 19, 24, 33, 42] methods.

The flow-based methods first predict the optical flow between the input frames and intermediate frames, then forward warp or backward warp the input frames or features by optical flow for synthesizing the intermediate frame. The forward warp methods use optical flow $F_{0 \rightarrow t}$ and $F_{1 \rightarrow t}$ to warp $I_0$ and $I_1$. To improve the accuracy of the estimated optical flow, Bao et al. [1] performed weight estimation based on depth information for multi-mapped pixel points. Foreground pixels with smaller depth values have high weights. For the hole position where no flow vectors pass through, they used four neighbor positions that have available flows by averaging them to compute the flow in the hole position. However, depth estimation is also a difficult task, and the accuracy of depth estimation greatly impacts the estimation of the final optical flow. In contrast, Niklaus et al. [30] proposed to use softmax to allocate weights by adding importance mask Z to get more accurate optical flow. However, [30] may not work well for scenes with large motion. Unlike forward warping, backward warping warps the input frames by the optical flow $F_{t \rightarrow 0}$ and $F_{t \rightarrow 1}$. For example, [15] linearly combined $F_{0 \rightarrow 1}$ and $F_{1 \rightarrow 0}$ to obtain more accurate optical flow $F_{t \rightarrow 0}$ and $F_{t \rightarrow 1}$. However, it will cause errors in the motion boundary. Park et al. [34] assumed that the optical flow $F_{t \rightarrow 0}$ and $F_{t \rightarrow 1}$ are symmetric and directly estimated the symmetric optical flow $F_{t \rightarrow 0}$ and $F_{t \rightarrow 1}$ based on the bilateral cost volume. Nevertheless, the assumption is not established in the real world. In order to solve this issue, Park[35] first estimated the symmetric optical flow $F_{t \rightarrow 0}$ and $F_{t \rightarrow 1}$ and then adjusted them to asymmetric that improve the accuracy of the predicted optical flow.

In video conferences, occlusions and large and nonlinear motions often occur, which makes it challenging to predict accurate optical flow. Since flow-based approaches heavily rely on the accuracy of bi-directional optical flows, these challenging conditions can result in inaccurate optical flow and severely affect the quality of the interpolated frame.

Compared with flow-based approaches, the kernel-based method unifies motion perception and target frame generation into convolution. For example, Niklaus et al. [31] proposed a fully convolutional neural network to estimate the 2D spatial adaptive convolution kernel of each pixel, which captures the local motion between input frames and the coefficients used for pixel synthesis. However, this approach is computationally expensive since it requires estimating large kernels for every pixel, which results in high memory usage. To reduce large memory demand, Niklaus et al. [32] proposed to separate each 2D convolution kernel into two 1D kernels. However, it is unable to handle motions large than the kernel size. To solve this problem, Lee et al. [23] provided a novel video frame interpolation method, which extracts spatial features through 2D CNNs to predict weights and offsets, which are used by deformable convolution to enlarge the range of convolution. Nevertheless, CNNs have limitations in extracting global spatial features, and the input video frames also contain temporal features, which may affect the accuracy of weights and offset predictions, [23] cannot interpolate high-quality intermediate frames in conference video. In this paper, we propose a Spatial-Temporal Deformable Convolution Network (STDC-Net) for conference video, which combines CNN with MLP to extract multi-scale spatial-temporal features. The spatial-temporal features are split into different lengths at various scales to extract local and global spatial features. The multi-scale spatial-temporal features are then used to predict more accurate weights and offsets for generating high-quality intermediate frames. In summary, our method differs from traditional kernel-based video frame interpolation approaches in two aspects: (1) We consider both spatial and temporal features of the input frames. (2) We integrate CNN and MLP to expand the receptive field and extract local-global spatial features, effectively handling large motions.

3 Proposed method

The goal of video frame interpolation is to generate an intermediate frame ${I}_t$ at temporal location $t\in (0,1)$ in between two given consecutive video frames $I_0$ and $I_1$. The framework of the proposed method is shown in Fig. 1.

3.1 Framework overview

As shown in Fig. 1, the proposed STDC-Net contains three modules: embedding layer, Spatial-Temporal Feature Learning module and Frame Synthesis module. Firstly, the embedding layer takes $I_0$ and $I_1$ as input and outputs shallow spatial-temporal features. The embedding layer is realized by 3D convolution, Batch Normalization(BN) and ReLu as an activation function. Secondly, Spatial-Temporal Feature Learning (STFL) module (Section 3.2) takes the shallow spatial-temporal features as input and outputs multi-scale deep spatial-temporal features. Finally, the Frame Synthesis (FS) module (Section 3.3) uses the extracted spatial-temporal features to predict weights, offsets and masks for generating the intermediate frame $I_t$ through deformable convolution.

3.2 Spatial-temporal feature learning

Recently, Zhang et al. [46] proposed an MLP-based model for video classification that achieves great performance. Based on this strategy, we use an MLP-based encoder-decoder architecture to learn multi-scale deep spatial-temporal features. As shown in Fig. 1, the encoder is composed of four stages, each stage consists of several Spatial-Temporal Feature Extracting (STFE) blocks and a 3D convolution layer using a stride of 2 to downsample the input features. Specifically, the four stages include 4, 3, 8, and 3 STFE blocks respectively. The output channel of each stage is 64, 128, 256 and 512, respectively. The decoder has four layers, the first three layers are realized by 3D deconvolution with a stride of 2 to upsample the feature maps, the last layer is realized by 3D convolution. Furthermore, to enhance the model’s robustness, skip connections are implemented between the encoder and decoder using concatenation to capture detailed and textured features. Next, we provide a more detailed description of the STFE block.

STFE block consists of Spatial Feature Extracting (SFE) block and Temporal Feature Extracting (TFE) block that are used to extract spatial features and temporal features, respectively. The SFE block processes spatial-temporal features of each frame separately by utilizing both the horizontal and vertical pathways. As shown in Fig. 2, we first split the input features $F\in \mathbb {R}^{H \times W \times C}$ (H, W and C are the height, width, and channel number of feature maps) in the horizontal pathway, where the length of each chunk is L. Based on [46], we further split each chunk in channel dimension to reduce computation cost, where each chunk has D channel. Then, we get $\frac{H \times W \times C}{D \times L}$ chunks $F_i\in \mathbb {R}^{L \times D}$. We send the chunks into MLP to transform them. Finally, we reshape all chunks to the original dimension $F^h\in \mathbb {R}^{H \times W \times C}$.

We also process the original spatial-temporal features F in vertical pathways. Similarly, we split F along the vertical direction and further split each chunk in channel dimension. Then we get $\frac{H \times W \times C}{D\times L}$ chunks $F_j\in \mathbb {R}^{L \times D}$. we send them to another MLP to transform them. After that, we reshape them back to the original dimension $F^v\in \mathbb {R}^{H \times W \times C}$. Finally, to obtain the correlations between channels, we feed the original spatial-temporal features F into MLP for processing and obtain the feature $F^c\in \mathbb {R}^{H \times W \times C}$. After getting the horizontal features $F^h$, vertical features $F^v$, and channel features $F^c$, we perform element-wise summation on them. The resulting features are then fed into an MLP to obtain weights $W^h$, $W^v$, and $W^c$ that are used to compute the final spatial features $F^s$.

The encoder of STFL has four stages, in the earlier layers, we set L to a smaller value in order to extract local finer spatial features, while in the later layers, we set L to a larger value in order to extract global features. After the above processing, we can get local and global spatial features $F_0^s$ and $F_1^s$ of the input frames.

As shown in Fig. 3, in order to capture temporal information, Temporal Feature Extracting (TFE) block concatenate the spatial feature $F_0^s$ and $F_1^s$ in temporal dimension as $F^s\in \mathbb {R}^{H \times W \times T \times C}$. Then, we split $F^s$ in channel dimension to reduce computation cost, where each chunk has D channels. After that, we concatenate features in the temporal dimension to get the chunk $F_i^s\in \mathbb {R}^{T \times D}$. Finally, we send these chunks into MLP and reshape the output back to the original dimension to get temporal features $F^t\in \mathbb {R}^{H \times W \times T \times C}$. Through the TFE block and SFE block, we can get spatial-temporal features of the input frames.

3.3 Frame synthesis

Frame Synthesis (FS) module uses the spatial-temporal features extracted from Spatial-Temporal Feature Learning (STFL) module to predict spatially-adaptive kernels for generating the intermediate frame.

Conventional convolution is employed in video frame interpolation [31, 32]. As shown in Fig. 4, they got two intermediate frames from $I_0$ and $I_1$ by predicting the convolution kernel for each pixel, this can be formulated as,

$$\begin{aligned} \begin{aligned} I_{i\rightarrow t}(x, y)&= \sum ^{N-1}_{k=0} \sum ^{N-1}_{l=0} W_{i,k,l}(x,y)I_i(x+k, y+l) \end{aligned} \end{aligned}$$

(1)

where $I_{i\rightarrow t}$ is the intermediate frame get from $I_i$, N is the kernel size, $I_i$ is the i-th frame of the input frames, $W_{i,k,l}$ are the kernel weights and $\left\{ (k,l)\right\} ^{N-1}_0 = \left\{ (-1,-1),(-1,0),...,(1,1)\right\} $.

As the kernel shape is a rectangular grid, it is unable to handle motions large than the kernel size. As shown in Fig. 4, AdaCof [23] addressed this issue by using spatially-adaptive deformable convolution, which can be formulated as,

$$\begin{aligned} \begin{aligned} I_{i\rightarrow t}(x, y)&= \sum ^{N-1}_{k=0} \sum ^{N-1}_{l=0} W_{i,k,l}(x,y)I_i(x+k+\alpha _{k,l}, y+l+\beta _{k,l}) \end{aligned} \end{aligned}$$

(2)

where $\left\{ \alpha _{k,l},\beta _{k,l}\right\} _0^{N-1}$ is a set of adaptable sampling offsets.

We follow the strategy of AdaCof. As shown in Fig. 1, we first use a 3D convolution layer to further process the spatial-temporal features which are obtained from the last decoder layer. We reshape the spatial-temporal feature maps as $F \in \mathbb {R}^{BT \times C \times H \times W}$, where B is the batch size, T is the number of input frames, C is the channel of feature maps, H and W are the height and width of feature maps. To expand the convolution field, we predicted offsets $(\alpha , \beta )$. Specifically, the offsets learn small displacements for each pixel in the spatial domain, which are used to adaptively adjust the sampling positions of the convolution kernel on the input frames. By learning the offsets, deformable convolution can capture deformation information in input frames, allowing the convolution operation to better adapt to spatial variations in the target. As shown in Fig. 1, in order to get $\alpha $, we use two 2D convolution layers to process the reshaped feature maps. To ensure that the size of the obtained feature maps matches the size of the input frames, we further process them using 2D transposed convolution. The process of obtaining $\beta $ is the same as the process of obtaining $\alpha $. By applying the offsets, we can determine the sampled pixels required for synthesizing the target pixel. In the next step, we utilize a network with the same architecture as the estimated offsets to process the spatial-temporal features, and normalize the obtained features using the softmax activation function to obtain the weights W. We reshape the predicted weights and offsets as $F \in \mathbb {R}^{B \times T \times C \times H \times W}$. Then, we split the weights and offsets into the temporal dimension. With the split weights and offsets, we obtain two intermediate frames from the input frames through formula 1.

Due to the presence of occlusions during the motion, certain pixels in the generated intermediate frames need to be masked. To implement this, We further predict masks using the same approach as estimating the weights, and split masks in the temporal dimension, and the final intermediate frame is got by $I_{0\rightarrow t}$ and $I_{1\rightarrow t}$ with the mask. It can be written as follows,

$$\begin{aligned} \begin{aligned} I_t&= V_0 \otimes I_{0\rightarrow t} + V_1 \otimes I_{1\rightarrow t} \end{aligned} \end{aligned}$$

(3)

where $\otimes $ is a pixel-wise multiplication. For a target pixel (x, y), $V_i = 1$ implies that the pixel is visible only in $I_i$ and $V_i = 0$ means it is visible only in another frame.

3.4 Training strategy

We implement STDC-Net with the Pytorch toolkit. The batch size is set as 8. We adopt the Adam optimizer [20] with $\beta _1 = 0.9$, $\beta _2 = 0.99$. The model is trained for 200 epochs with an initial learning rate of $2 \times 10^{-4}$. We use L1 loss to train STDC-Net. The loss function is expressed as follows,

$$\begin{aligned} \begin{aligned} \mathcal {L}&= \Vert I_t - \hat{I}_t \Vert _1 \end{aligned} \end{aligned}$$

(4)

where $I_t$ is the output of STDC-Net, $\hat{I}_t$ is the ground truth.

4 Experiments

4.1 Experimental settings

Dataset

As our goal is to perform frame interpolation for video conferences, we chose to use commonly used facial datasets for training and testing. We use the Voxcele2 [5] dataset to train STDC-Net. It is a large-scale speaker recognition dataset contains over 1 million utterances from 6000 speakers. The video frame rate of Voxceleb2 is 25FPS and the resolution is 224 $\times $ 224. We select 20 thousand videos from it to be the train set, the video duration is 1 second. For data augmentation, we randomly perform vertical and horizontal flipping, as well as reversing sequence order.

We assess the performance of the models on Voxcele2 [5] and HDTF [48] datasets. Similarly, we select 3000 triplet sequences from Voxcele2 to be test set. HDTF is a large-scale video dataset designed for talking face generation tasks. The video frame rate is 30FPS, and the resolution of the original video is 720P or 1080P. We use a landmark detector [18] to crop the face region. The size of the final images is 512 $\times $ 512. We select 2993 triplet sequences from it to be test set.

Evaluation Metrics

We use the commonly used reconstruction metrics Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [40], Multi-Scale Structural SIMilarity (MS-SSIM) [41] and Learned Perceptual Image Patch Similarity (LPIPS) [47] to evaluate the quality of the interpolated frames. Higher PSNR, SSIM and MS-SSIM indicate better performance, while lower LPIPS indicates better performance.

4.2 Comparison with state-of-the-art method

We evaluate the proposed method against the state-of-the art video frame interpolation methods: SuperSloMo [15], CAIN [4], AdaCof [23](baseline of the proposed method), XVFI [38], VFIT [37] and FLAVR [17].

SuperSloMo [15] is a flow-based method that employs a linear combination of bidirectional optical flows and utilizes a Unet architecture to further refine them, resulting in the generation of the final intermediate frame.
CAIN [4] utilizes the PixelShuffle operator and incorporates channel attention to implicitly capture motion information in order to generate the intermediate frame.
AdaCof [23] is a kernel-based method that extracts spatial features using 2DCNNs and utilizes these features to predict offsets, weights, and masks for generating the intermediate frame.
XVFI [38] is based on a recursive multi-scale shared structure, combining both linear approximation and flow reversal techniques to obtain the final optical flows. It also employs a Unet architecture to predict masks for generating the intermediate frame.
VFIT [37] is a Transformer-based architecture that leverages attention mechanisms to extract spatial-temporal features. These features are then used to predict weights, offsets, and masks, which are utilized to generate the intermediate frame.
FLAVR [17] utilizes 3D convolutions to extract spatial-temporal features and employs 3D transpose convolutions for feature decoding, resulting in the generation of the intermediate frame.

Quantitative Results

We calculate the values of PSNR, SSIM, and MS-SSIM for each algorithm on Voxceleb2 and HDTF test sets. The results are shown in Table 1. According to the results presented in the table, it can be observed that the proposed STDC-Net outperforms all the compared methods on both Voxceleb2 and HDTF datasets. Notably, VFIT achieves the second-best performance on Voxceleb2 test set, while XVFI obtains the second-best performance on HDTF dataset. In comparison to VFIT, STDC-Net achieves PSNR improvements of 0.051dB on Voxceleb2 test set, while compared with XVFI, it achieves PSNR improvements of 0.098dB on HDTF test set. Furthermore, STDC-Net achieves PSNR improvements of 0.13dB and 0.17dB over the baseline AdaCof on the Voxceleb2 and HDTF test sets, respectively.

Table 1 Quantitative comparisons on the Voxceleb2 and HDTF datasets. The best result is bold, while the second best is underlined

Full size table

We also measure LPIPS for each algorithm, the results are presented in Fig. 5. As can be observed from the results in Fig. 5, AdaCof and our proposed STDC-Net achieve the best performance on the Voxceleb2 and HDTF test sets, respectively. All qualitative experimental results obtained on both Voxceleb2 and HDTF test sets exhibit the effectiveness of our proposed STDC-Net for facial video frame interpolation.

Qualitative Results

We present qualitative comparisons on Voxceleb2 and HDTF datasets in Figs. 6 and 7. As shown in these figures, when large and nonlinear motion as well as occlusion occurs, the proposed method generates visually more pleasing results with clearer structures and fewer distortions compared to all compared approaches. We also present qualitative evaluation results showcasing scenarios with simultaneous occurrences of large motion and occlusion in Fig. 8. From Fig. 8, it can be observed that the person is blinking while speaking, leading to substantial motion and occlusion in the eye and mouth regions. The experimental results from Fig. 8 demonstrate that our method achieves favorable performance. To further evaluate the accuracy of our interpolation results, we present error maps of the interpolated frames in Fig. 9. The colors closer to red indicate larger errors, while those closer to blue indicate smaller errors. As shown in Fig. 9, when the man moves his head, the proposed method and SuperSloMo [15] achieve the best and second best performance, respectively. Moreover, SuperSloMo [15] predicts inaccurate optical flow that leads to errors in the neck boundary, while the proposed method generates fewer errors in this regard. These findings further demonstrate the effectiveness of our proposed method for video frame interpolation.

We also visualize the offsets and masks of the proposed method and the baseline (AdaCof [23]). Specifically, we visualize the offsets by weighting the sum of the backward and forward offset vectors for each pixel and calling it as MeanFlow. It can be written as follows,

$$\begin{aligned} \begin{aligned} F(x,y)&= \sum ^{N-1}_{k=0} \sum ^{N-1}_{l=0} W_{k,l}(x,y)(\alpha _{(k,l)}, \beta _{(k,l)}) \end{aligned} \end{aligned}$$

(5)

The visual results are presented in Fig. 10, which demonstrate the effectiveness of the proposed STDC-Net for facial video frame interpolation. Specifically, the depicted scenario involves a moving head and speaking subject, which poses a challenging task for accurate frame interpolation. The proposed method outperforms AdaCof [23] in terms of predicting more accurate offsets and masks. All qualitative results not only corroborate the effectiveness of the proposed method but also demonstrate the potential of our approach in addressing complex and challenging video frame interpolation tasks.

4.3 Ablation study

To assess the effectiveness of our proposed method, we conduct ablation studies on Voxceleb2 and HDTF datasets by varying the network architecture.

To validate the efficacy of spatial-temporal features in video frame interpolation, we trained a model (first row of Table 2, Fig. 11 (e) Model V1 and Fig. 12 (e) Model V1) that exclusively extracts spatial features. To achieve this, we replaced the 3D convolutions in embedding layer and downsample layers with 2D convolutions, while the TFE blocks were removed from the STFE blocks. The experimental results depicted in Table 2 indicate that the proposed STDC-Net outperforms the Model V1 with respect to various performance metrics (PSNR and SSIM). Particularly, the proposed STDC-Net demonstrates an improvement in PSNR values of 0.16dB and 0.26dB on the Voxceleb2 and HDTF test sets, respectively.

To validate the effectiveness of combining CNNs and MLPs in video frame interpolation, we trained another model (second row of Table 2, Fig. 11 (f) Model V2 and Fig. 12 (f) Model V2), which utilizes ResNet-3D (R3D) [39] to extract multi-scale deep spatial-temporal features. The experimental results are presented in Table 2, indicating that compared to Model V2, our proposed STDC-Net achieves the best performance in terms of PSNR and SSIM metrics. Specifically, it achieves an improvement in PSNR values of 0.06dB and 0.07dB on Voxceleb2 and HDTF test sets, respectively.

Table 2 The result of ablation study on Voxceleb2 and HDTF datasets. The best result is bold, while the second best is underlined

Full size table

We conducted additional visual comparisons on the HDTF dataset, as presented in Figs. 11 and 12. In both figures, we observe that the two individuals are simultaneously nodding and speaking. From the visual comparisons, it can be observed that the intermediate frame generated by Model V1 exhibits the worst performance, with severe distortions in the eye and mouth regions. On the other hand, the intermediate frame generated by Model V2 exhibit better results in the eye region, but the results in the mouth region are not satisfactory. In contrast, our proposed method generates intermediate frames with less distortion and clearer contours in the mouth and eye regions, which are closer to the ground truth intermediate frames. The above experimental results demonstrate the effectiveness of extracting spatial-temporal features and combining CNNs with MLPs in video frame interpolation.

5 Conclusion

In this paper, we have proposed a Spatial-Temporal Deformable Convolution Network (STDC-Net) for conference video frame interpolation to address the issue of dropped frames or reduced frame rates in video conference communication, which improves the video quality and creates a more immersive and engaging experience for all participants. In particular, STDC-Net extracts spatial-temporal features by splitting feature maps along horizontal, vertical, and temporal pathways and processing them with different MLPs in multi scales. Based on the spatial-temporal features the model generates the intermediate frame by predicting weights, offsets, and masks. By setting different split lengths in different scales, the proposed STDC-Net can extract both local and global spatial features. This enables the model to handle challenging scenarios such as moving subjects, facial expressions, and background changes, which are common in real-world video conference communication. The experiments have shown that the proposed model outperforms many state-of-the-art methods with gains, where compare to baseline the PSNR improves 0.13dB and 0.17dB on Voxcele2 and HDTF datasets, respectively.

Data availability

Not applicable.

Code availability

Not applicable.

References

Bao W, Lai W-S, Ma C, Zhang X, Gao Z, Yang M-H (2019) Depth-aware video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3703–3712
Cheng X, Chen Z (2021) Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Trans Patt Anal Mach Intell 44(10):7029–7045
Article MathSciNet Google Scholar
Cheng X, Chen Z (2020) Video frame interpolation via deformable separable convolution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp 10607–10614
Choi M, Kim H, Han B, Xu N, Lee KM (2020) Channel attention is all you need for video frame interpolation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp 10663–10671
Chung JS, Nagrani A, Zisserman A (2018) Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622
Danier D, Zhang F, Bull D (2022) Enhancing deformable convolution based video frame interpolation with coarse-to-fine 3D CNN. In: 2022 IEEE International Conference on Image Processing (ICIP), pp 1396–1400. IEEE
Danier D, Zhang F, Bull D (2022) St-mfnet: a spatio-temporal multi-flow network for frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3521–3531
Ding T, Liang L, Zhu Z, Zharkov I (2021) CDFI: Compression-driven network design for frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8001–8011
Dutta S, Subramaniam A, Mittal A (2022) Non-linear motion estimation for video frame interpolation using space-time convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1726–1731
Figueirêdo P, Paliwal A, Kalantari NK (2023) Frame interpolation for dynamic scenes with implicit flow encoding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 218–228
Hu M, Xiao J, Liao L, Wang Z, Lin C-W, Wang M, Satoh S (2021) Capturing small, fast-moving objects: Frame interpolation via recurrent motion enhancement. IEEE Trans Circ Syst Video Technol 32(6):3390–3406
Article Google Scholar
Hu M, Jiang K, Liao L, Nie Z, Xiao J, Wang Z (2022) Progressive spatial-temporal collaborative network for video frame interpolation. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 2145–2153
Hu M, Liao L, Xiao J, Gu L, Satoh S (2020) Motion feedback design for video frame interpolation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4347–4351. IEEE
Hu P, Niklaus S, Sclaroff S, Saenko K (2022) Many-to-many splatting for efficient video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3553–3562
Jiang H, Sun D, Jampani V, Yang M-H, Learned-Miller E, Kautz J (2018) Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9000–9008
Jin X, Wu L, Shen G, Chen Y, Chen J, Koo J, Hahm C-H (2023) Enhanced bi-directional motion estimation for video frame interpolation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 5049–5057
Kalluri T, Pathak D, Chandraker M, Tran D (2023) Flavr: Flow-agnostic video representations for fast frame interpolation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 2071–2082
Kazemi V, Sullivan J (2014) One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1867–1874
Khalifeh I, Blanch MG, Izquierdo E, Mrak M (2022) Multi-encoder network for parameter reduction of a kernel-based interpolation architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 725–734
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kong L, Liu J, Yang J (2022) Progressive motion context refine network for efficient video frame interpolation. IEEE Signal Process Lett 29:2338–2342
Article Google Scholar
Kong L, Jiang B, Luo D, Chu W, Huang X, Tai Y, Wang C, Yang J (2022) Ifrnet: Intermediate feature refine network for efficient frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2022)
Lee H, Kim T, Chung T-Y, Pak D, Ban Y, Lee S (2020) Adacof: Adaptive collaboration of flows for video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5316–5325
Li H-D, Yin H, Liu Z-H, Huang H (2022) Enhanced spatial-temporal freedom for video frame interpolation. Appl Intell 1–13
Li Y, Zhu Y, Li R, Wang X, Luo Y, Shan Y (2022) Hybrid warping fusion for video frame interpolation. Int J Comput Vis 130(12):2980–2993
Article Google Scholar
Liu J, Kong L, Yang J (2022) Atca: an arc trajectory based model with curvature attention for video frame interpolation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp 1486–1490. IEEE
Liu Y, Xie L, Siyao L, Sun W, Qiao Y, Dong C (2020) Enhanced quadratic video interpolation. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp 41–56. Springer
Niklaus S, Hu P, Chen J (2023) Splatting-based synthesis for video frame interpolation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 713–723
Niklaus S, Liu F (2018) Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1701–1710
Niklaus S, Liu F (2020) Softmax splatting for video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5437–5446
Niklaus S, Mai L, Liu F (2017) Video frame interpolation via adaptive convolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 670–679
Niklaus S, Mai L, Liu F (2017) Video frame interpolation via adaptive separable convolution. In: Proceedings of the IEEE International Conference on Computer Vision, pp 261–270
Niklaus S, Mai L, Wang O (2021) Revisiting adaptive convolutions for video frame interpolation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1099–1109
Park J, Ko K, Lee C, Kim C-S (2020) BMBC Bilateral motion estimation with bilateral cost volume for video interpolation. In: European Conference on Computer Vision, pp 109–125. Springer
Park J, Lee C, Kim C-S (2021) Asymmetric bilateral motion estimation for video frame interpolation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14539–14548
Shi Z, Liu X, Shi K, Dai L, Chen J (2021) Video frame interpolation via generalized deformable convolution. IEEE Trans Multimedia 24:426–439
Article Google Scholar
Shi Z, Xu X, Liu X, Chen J, Yang M-H (2022) Video frame interpolation transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17482–17491
Sim H, Oh J, Kim M (2021) Xvfi: Extreme video frame interpolation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14489–14498
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6450–6459
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Article Google Scholar
Wang Z, Simoncelli EP, Bovik AC (2003) Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp 1398–1402. IEEE
Wijma R, You S, Li Y (2021) Multi-level adaptive separable convolution for large-motion video frame interpolation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1127–1135
Xiao J, Xu K, Hu M, Liao L, Wang Z, Lin C-W, Wang M, Satoh S (2022) Progressive Motion Boosting for Video Frame Interpolation. IEEE Transactions on Multimedia. IEEE
Xing J, Hu W, Zhang Y, Wong T-T (2021) Flow-aware synthesis: a generic motion model for video frame interpolation. Comput Vis Media 7:393–405
Article Google Scholar
Xu X, Siyao L, Sun W, Yin Q, Yang M-H (2019) Quadratic video interpolation. Advances in Neural Information Processing Systems 32
Zhang DJ, Li K, Wang Y, Chen Y, Chandra S, Qiao Y, Liu L, Shou MZ (2022) Morphmlp: an efficient mlp-like backbone for spatial-temporal representation learning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp 230–248. Springer
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 586–595
Zhang Z, Li L, Ding Y, Fan C (2021) Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3661–3670

Download references

Funding

This research was funded by Natural Science Foundation of China (U22A2035, No. 61701194), Application Foundation Frontier Special Project of Wuhan Science and Technology Plan Project (No. 2020010601012288), Doctoral Research Foundation of Jianghan University (2019029), Nature Science Foundation of Hubei Province (2017CFB756).

Author information

Jinhui Hu and Qianrui Wang contributed equally to this work.

Authors and Affiliations

The Smart City Research Institute, China Electronics Technology group Corporation, Shenzhen, 518038, China
Jinhui Hu
National Center for Applied Mathematics Shenzhen, Shenzhen, 518000, China
Jinhui Hu
School of Artificial Intelligence, Jianghan University, Wuhan, 430056, China
Qianrui Wang, Dengshi Li & Yu Gao

Authors

Jinhui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qianrui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dengshi Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Gao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, J.H. and Q.W.; methodology, J.H. and Q.W.; validation, J.H., Q.W. and Y.G.; investigation, J.H., Q.W. and Y.G.; resources, D.L.; writing-original draft preparation, J.H. and Q.W.; writing-review and editing, J.H., Q.W. and D.L.; visualization, J.H. and Q.W. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Dengshi Li.

Ethics declarations

Ethics Approval

Hereby, we consciously assure that for the manuscript “STDC-Net: A Spatial-Temporal Deformable Convolution Network for Conference Video Frame Interpolation” the following is fulflled: 1) This material is the authors’ own original work, which has not been previously published elsewhere. 2) The paper is not currently being considered for publication elsewhere. 3) The paper properly credits the meaningful contributions of co-authors and co-researchers. 4) All sources used are properly disclosed (correct citation).

Consent to participate

Written informed consent was obtained from individual or guardian participants.

Consent for publication

Written informed consent for publication was obtained from all participants.

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hu, J., Wang, Q., Li, D. et al. STDC-Net: A spatial-temporal deformable convolution network for conference video frame interpolation. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-16266-0

Download citation

Received: 25 April 2023
Revised: 16 June 2023
Accepted: 04 July 2023
Published: 19 September 2023
DOI: https://doi.org/10.1007/s11042-023-16266-0

STDC-Net: A spatial-temporal deformable convolution network for conference video frame interpolation

Abstract

Similar content being viewed by others

A review of convolutional neural networks in computer vision

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

1 Introduction

2 Related work

3 Proposed method

3.1 Framework overview

3.2 Spatial-temporal feature learning

3.3 Frame synthesis

3.4 Training strategy

4 Experiments

4.1 Experimental settings

Dataset

Evaluation Metrics

4.2 Comparison with state-of-the-art method

Quantitative Results

Qualitative Results

4.3 Ablation study

5 Conclusion

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics Approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation