1 Introduction

In recent years, video conferencing has become an essential tool for communication and collaboration in various industries and settings. It enables people to connect and work together seamlessly, regardless of their location. It also contributes to increased productivity, reduced travel costs, and improved work-life balance. However, when a video conference is conducted over a low-bandwidth or high-latency network, the video stream may experience dropped frames or reduced frame rates, resulting in a lower-quality video that seriously affects user experience. Video frame interpolation is often used to solve the above issue, i.e., through video frame interpolation, additional frames can be generated to fill in the gaps between the original frames, resulting in a smoother and more fluid motion. In addition, video conferences are often recorded for later playback or archival purposes, and video frame interpolation interpolates one or more frames to every two consecutive frames of the original video thus improving the smoothness of the video and creating a more seamless playback experience.

Deep learning has enabled the development of various video frame interpolation algorithms that can be classified as flow-based [1, 9, 12, 13, 15, 22, 29, 34, 38, 43, 45]and kernel-based methods [3, 6, 23, 31, 32, 36, 37]. Flow-based methods estimate the bi-direction optical flow between input and target frames to warp the input frames, which heavily depend on the accuracy of bi-direction optical flow. For instance, Xu et al. [45] used optical flow between four frames to obtain quadratic motion parameters through an analytical solution for improving the accuracy of optical flow prediction. However, when there are sudden changes in motion, such as sudden jerks, the predicted optical flow based on quadratic motion may be inaccurate, resulting in suboptimal performance of the interpolated intermediate frames. To address this issue, Dutta et al. [9] used 3D Convolutional Neural Networks (3DCNN) to estimate the motion parameters that further improved the accuracy of the optical flow. However, during video conferences, the oral region of participants is often occluded when speaking, with facial and head movements sometimes being large and nonlinear. This makes it difficult to accurately predict optical flow, which results in poor quality of the interpolated intermediate frames. In contrast, kernel-based methods learn convolution kernels to synthesize the intermediate frame. The interpolation accuracy depends on the size of the kernel, but a larger kernel requires more computation. To solve this issue, [23] predicted offsets and used deformable convolution to enlarge the range of convolution that has significantly improved performance compared with conventional video frame interpolation approaches. However, video frames are different from single images in that they contain temporal features, which refer to the motion or changes that occur over time. The feature extraction module in [23] is primarily focused on extracting spatial features. In addition, the feature extraction module consists of Convolutional Neural Networks (CNNs) that has a limited receptive field, it can only capture local features within a certain range of pixels. When a pixel experiences significant motion, such as a man shaking his head during a video conference, the limitations of CNNs in capturing global information will result in poor performance for conference video frame interpolation.

In order to solve the aforementioned issue, we propose a Spatial-Temporal Deformable Convolution Network (STDC-Net) for video frame interpolation in video conferences. Specifically, STDC-Net first uses an embedding layer to generate shallow spatial-temporal features of the input frames. Then the Spatial-Temporal Feature Learning (STFL) module generates multi-scale deep spatial-temporal features of the input frames. STFL module is an encoder-decoder architecture, the encoder has four stages, and each stage contains several Spatial-Temporal Feature Extract (STFE) blocks and a downsample layer. The STFE block is composed of Multi-Layer Perceptrons (MLPs), which are feedforward artificial neural networks consisting of multiple layers of interconnected nodes known as neurons. Each neuron receives input features, performs a weighted sum of those inputs, applies an activation function, and passes the result to the next layer, each layer learns increasingly complex representations of the input features. STFE block splits the spatial-temporal feature in the horizontal, vertical, and temporal pathway, then sends the split features into three different MLPs to learn spatial and temporal features. By splitting feature maps into different lengths in the horizontal and vertical pathways at different stages, we can obtain local and global spatial features. The decoder is realized by 3D deconvolution and 3D convolution layers to upsample the features. Besides, skip connections are used between the encoder and decoder to capture more detailed features. Finally, Frame Synthesis (FS) module predicts the adaptive kernels and offsets to sample the pixel points in each frame for generating two intermediate frames. In addition, the occlusion masks is predicted to weigh the two intermediate frames for generating the final intermediate frame. We conduct extensive experiments on Voxceleb2 [5] and HDTF [48] datasets, and the experiment results demonstrate that our method outperforms state-of-the-art methods.

The main contributions of our work are summarized as follows:

  • We have proposed a novel video frame interpolation algorithm specifically designed for video conferencing, effectively addressing the issues of frame drops and low video frame rates caused by network or bandwidth limitations.

  • We have split features into different lengths along horizontal, vertical, and temporal directions across various scales. These split features were then fed into a network that integrates 3DCNN and MLP to extract local and global features, improving offset prediction accuracy and effectively handling large motions.

  • We have conducted a series of comprehensive experiments on Voxceleb2 and HDTF datasets. The results indicate that the proposed method outperforms state-of-the-art methods, achieving the best performance.

The structure of this paper is outlined as follows: Related work is reviewed in Section 2. The proposed algorithm is described in Section 3. The experimental results are discussed in Section 4. Section 5 concludes the paper.

2 Related work

Video frame interpolation aims at interpolating one or more intermediate frames from two consecutive frames while maintaining spatial and temporal consistencies. Existing methods can be mainly classified into two categories: flow-based [10, 11, 14, 16, 21, 25,26,27,28, 30, 44] and kernel-based [2, 7, 8, 19, 24, 33, 42] methods.

The flow-based methods first predict the optical flow between the input frames and intermediate frames, then forward warp or backward warp the input frames or features by optical flow for synthesizing the intermediate frame. The forward warp methods use optical flow \(F_{0 \rightarrow t}\) and \(F_{1 \rightarrow t}\) to warp \(I_0\) and \(I_1\). To improve the accuracy of the estimated optical flow, Bao et al. [1] performed weight estimation based on depth information for multi-mapped pixel points. Foreground pixels with smaller depth values have high weights. For the hole position where no flow vectors pass through, they used four neighbor positions that have available flows by averaging them to compute the flow in the hole position. However, depth estimation is also a difficult task, and the accuracy of depth estimation greatly impacts the estimation of the final optical flow. In contrast, Niklaus et al. [30] proposed to use softmax to allocate weights by adding importance mask Z to get more accurate optical flow. However, [30] may not work well for scenes with large motion. Unlike forward warping, backward warping warps the input frames by the optical flow \(F_{t \rightarrow 0}\) and \(F_{t \rightarrow 1}\). For example, [15] linearly combined \(F_{0 \rightarrow 1}\) and \(F_{1 \rightarrow 0}\) to obtain more accurate optical flow \(F_{t \rightarrow 0}\) and \(F_{t \rightarrow 1}\). However, it will cause errors in the motion boundary. Park et al. [34] assumed that the optical flow \(F_{t \rightarrow 0}\) and \(F_{t \rightarrow 1}\) are symmetric and directly estimated the symmetric optical flow \(F_{t \rightarrow 0}\) and \(F_{t \rightarrow 1}\) based on the bilateral cost volume. Nevertheless, the assumption is not established in the real world. In order to solve this issue, Park[35] first estimated the symmetric optical flow \(F_{t \rightarrow 0}\) and \(F_{t \rightarrow 1}\) and then adjusted them to asymmetric that improve the accuracy of the predicted optical flow.

In video conferences, occlusions and large and nonlinear motions often occur, which makes it challenging to predict accurate optical flow. Since flow-based approaches heavily rely on the accuracy of bi-directional optical flows, these challenging conditions can result in inaccurate optical flow and severely affect the quality of the interpolated frame.

Compared with flow-based approaches, the kernel-based method unifies motion perception and target frame generation into convolution. For example, Niklaus et al. [31] proposed a fully convolutional neural network to estimate the 2D spatial adaptive convolution kernel of each pixel, which captures the local motion between input frames and the coefficients used for pixel synthesis. However, this approach is computationally expensive since it requires estimating large kernels for every pixel, which results in high memory usage. To reduce large memory demand, Niklaus et al. [32] proposed to separate each 2D convolution kernel into two 1D kernels. However, it is unable to handle motions large than the kernel size. To solve this problem, Lee et al. [23] provided a novel video frame interpolation method, which extracts spatial features through 2D CNNs to predict weights and offsets, which are used by deformable convolution to enlarge the range of convolution. Nevertheless, CNNs have limitations in extracting global spatial features, and the input video frames also contain temporal features, which may affect the accuracy of weights and offset predictions, [23] cannot interpolate high-quality intermediate frames in conference video. In this paper, we propose a Spatial-Temporal Deformable Convolution Network (STDC-Net) for conference video, which combines CNN with MLP to extract multi-scale spatial-temporal features. The spatial-temporal features are split into different lengths at various scales to extract local and global spatial features. The multi-scale spatial-temporal features are then used to predict more accurate weights and offsets for generating high-quality intermediate frames. In summary, our method differs from traditional kernel-based video frame interpolation approaches in two aspects: (1) We consider both spatial and temporal features of the input frames. (2) We integrate CNN and MLP to expand the receptive field and extract local-global spatial features, effectively handling large motions.

Fig. 1
figure 1

The framework of the proposed Spatial-Temporal Deformable Convolution Network (STDC-Net) for video frame interpolation. It first extracts the shallow spatial-temporal features by Embedding Layer. Then, Spatial-Temporal Feature Learning (STFL) module extracts multi-scale deep spatial-temporal features by Spatial-Temporal Feature Extracting (STFE) blocks. Finally, the Feature Synthesis (FS) module predicts weights, offsets, and masks to generate the intermediate frame

3 Proposed method

The goal of video frame interpolation is to generate an intermediate frame \({I}_t\) at temporal location \(t\in (0,1)\) in between two given consecutive video frames \(I_0\) and \(I_1\). The framework of the proposed method is shown in Fig. 1.

3.1 Framework overview

As shown in Fig. 1, the proposed STDC-Net contains three modules: embedding layer, Spatial-Temporal Feature Learning module and Frame Synthesis module. Firstly, the embedding layer takes \(I_0\) and \(I_1\) as input and outputs shallow spatial-temporal features. The embedding layer is realized by 3D convolution, Batch Normalization(BN) and ReLu as an activation function. Secondly, Spatial-Temporal Feature Learning (STFL) module (Section 3.2) takes the shallow spatial-temporal features as input and outputs multi-scale deep spatial-temporal features. Finally, the Frame Synthesis (FS) module (Section 3.3) uses the extracted spatial-temporal features to predict weights, offsets and masks for generating the intermediate frame \(I_t\) through deformable convolution.

3.2 Spatial-temporal feature learning

Recently, Zhang et al. [46] proposed an MLP-based model for video classification that achieves great performance. Based on this strategy, we use an MLP-based encoder-decoder architecture to learn multi-scale deep spatial-temporal features. As shown in Fig. 1, the encoder is composed of four stages, each stage consists of several Spatial-Temporal Feature Extracting (STFE) blocks and a 3D convolution layer using a stride of 2 to downsample the input features. Specifically, the four stages include 4, 3, 8, and 3 STFE blocks respectively. The output channel of each stage is 64, 128, 256 and 512, respectively. The decoder has four layers, the first three layers are realized by 3D deconvolution with a stride of 2 to upsample the feature maps, the last layer is realized by 3D convolution. Furthermore, to enhance the model’s robustness, skip connections are implemented between the encoder and decoder using concatenation to capture detailed and textured features. Next, we provide a more detailed description of the STFE block.

Fig. 2
figure 2

Processing of extracting spatial features. It first splits feature maps along horizontal and vertical pathways, respectively. Then it sends them into MLPs and reshapes them back to their original shape. In addition, to reduce computation cost, it further splits feature maps in channel dimension

STFE block consists of Spatial Feature Extracting (SFE) block and Temporal Feature Extracting (TFE) block that are used to extract spatial features and temporal features, respectively. The SFE block processes spatial-temporal features of each frame separately by utilizing both the horizontal and vertical pathways. As shown in Fig. 2, we first split the input features \(F\in \mathbb {R}^{H \times W \times C}\) (H, W and C are the height, width, and channel number of feature maps) in the horizontal pathway, where the length of each chunk is L. Based on [46], we further split each chunk in channel dimension to reduce computation cost, where each chunk has D channel. Then, we get \(\frac{H \times W \times C}{D \times L}\) chunks \(F_i\in \mathbb {R}^{L \times D}\). We send the chunks into MLP to transform them. Finally, we reshape all chunks to the original dimension \(F^h\in \mathbb {R}^{H \times W \times C}\).

We also process the original spatial-temporal features F in vertical pathways. Similarly, we split F along the vertical direction and further split each chunk in channel dimension. Then we get \(\frac{H \times W \times C}{D\times L}\) chunks \(F_j\in \mathbb {R}^{L \times D}\). we send them to another MLP to transform them. After that, we reshape them back to the original dimension \(F^v\in \mathbb {R}^{H \times W \times C}\). Finally, to obtain the correlations between channels, we feed the original spatial-temporal features F into MLP for processing and obtain the feature \(F^c\in \mathbb {R}^{H \times W \times C}\). After getting the horizontal features \(F^h\), vertical features \(F^v\), and channel features \(F^c\), we perform element-wise summation on them. The resulting features are then fed into an MLP to obtain weights \(W^h\), \(W^v\), and \(W^c\) that are used to compute the final spatial features \(F^s\).

The encoder of STFL has four stages, in the earlier layers, we set L to a smaller value in order to extract local finer spatial features, while in the later layers, we set L to a larger value in order to extract global features. After the above processing, we can get local and global spatial features \(F_0^s\) and \(F_1^s\) of the input frames.

Fig. 3
figure 3

Processing of extracting temporal features. It first splits feature maps in channel dimension to reduce computation cost, then splits the feature maps based on pixel and concatenates the feature maps in temporal dimension (red arrow), and finally sends them into MLP and reshapes them back to original shape to get temporal features

As shown in Fig. 3, in order to capture temporal information, Temporal Feature Extracting (TFE) block concatenate the spatial feature \(F_0^s\) and \(F_1^s\) in temporal dimension as \(F^s\in \mathbb {R}^{H \times W \times T \times C}\). Then, we split \(F^s\) in channel dimension to reduce computation cost, where each chunk has D channels. After that, we concatenate features in the temporal dimension to get the chunk \(F_i^s\in \mathbb {R}^{T \times D}\). Finally, we send these chunks into MLP and reshape the output back to the original dimension to get temporal features \(F^t\in \mathbb {R}^{H \times W \times T \times C}\). Through the TFE block and SFE block, we can get spatial-temporal features of the input frames.

3.3 Frame synthesis

Frame Synthesis (FS) module uses the spatial-temporal features extracted from Spatial-Temporal Feature Learning (STFL) module to predict spatially-adaptive kernels for generating the intermediate frame.

Conventional convolution is employed in video frame interpolation [31, 32]. As shown in Fig. 4, they got two intermediate frames from \(I_0\) and \(I_1\) by predicting the convolution kernel for each pixel, this can be formulated as,

$$\begin{aligned} \begin{aligned} I_{i\rightarrow t}(x, y)&= \sum ^{N-1}_{k=0} \sum ^{N-1}_{l=0} W_{i,k,l}(x,y)I_i(x+k, y+l) \end{aligned} \end{aligned}$$
(1)

where \(I_{i\rightarrow t}\) is the intermediate frame get from \(I_i\), N is the kernel size, \(I_i\) is the i-th frame of the input frames, \(W_{i,k,l}\) are the kernel weights and \(\left\{ (k,l)\right\} ^{N-1}_0 = \left\{ (-1,-1),(-1,0),...,(1,1)\right\} \).

As the kernel shape is a rectangular grid, it is unable to handle motions large than the kernel size. As shown in Fig. 4, AdaCof [23] addressed this issue by using spatially-adaptive deformable convolution, which can be formulated as,

$$\begin{aligned} \begin{aligned} I_{i\rightarrow t}(x, y)&= \sum ^{N-1}_{k=0} \sum ^{N-1}_{l=0} W_{i,k,l}(x,y)I_i(x+k+\alpha _{k,l}, y+l+\beta _{k,l}) \end{aligned} \end{aligned}$$
(2)

where \(\left\{ \alpha _{k,l},\beta _{k,l}\right\} _0^{N-1}\) is a set of adaptable sampling offsets.

Fig. 4
figure 4

The difference between using conventional convolution (a) and using deformable convolution (b) in video frame interpolation

We follow the strategy of AdaCof. As shown in Fig. 1, we first use a 3D convolution layer to further process the spatial-temporal features which are obtained from the last decoder layer. We reshape the spatial-temporal feature maps as \(F \in \mathbb {R}^{BT \times C \times H \times W}\), where B is the batch size, T is the number of input frames, C is the channel of feature maps, H and W are the height and width of feature maps. To expand the convolution field, we predicted offsets \((\alpha , \beta )\). Specifically, the offsets learn small displacements for each pixel in the spatial domain, which are used to adaptively adjust the sampling positions of the convolution kernel on the input frames. By learning the offsets, deformable convolution can capture deformation information in input frames, allowing the convolution operation to better adapt to spatial variations in the target. As shown in Fig. 1, in order to get \(\alpha \), we use two 2D convolution layers to process the reshaped feature maps. To ensure that the size of the obtained feature maps matches the size of the input frames, we further process them using 2D transposed convolution. The process of obtaining \(\beta \) is the same as the process of obtaining \(\alpha \). By applying the offsets, we can determine the sampled pixels required for synthesizing the target pixel. In the next step, we utilize a network with the same architecture as the estimated offsets to process the spatial-temporal features, and normalize the obtained features using the softmax activation function to obtain the weights W. We reshape the predicted weights and offsets as \(F \in \mathbb {R}^{B \times T \times C \times H \times W}\). Then, we split the weights and offsets into the temporal dimension. With the split weights and offsets, we obtain two intermediate frames from the input frames through formula 1.

Due to the presence of occlusions during the motion, certain pixels in the generated intermediate frames need to be masked. To implement this, We further predict masks using the same approach as estimating the weights, and split masks in the temporal dimension, and the final intermediate frame is got by \(I_{0\rightarrow t}\) and \(I_{1\rightarrow t}\) with the mask. It can be written as follows,

$$\begin{aligned} \begin{aligned} I_t&= V_0 \otimes I_{0\rightarrow t} + V_1 \otimes I_{1\rightarrow t} \end{aligned} \end{aligned}$$
(3)

where \(\otimes \) is a pixel-wise multiplication. For a target pixel (x, y), \(V_i = 1\) implies that the pixel is visible only in \(I_i\) and \(V_i = 0\) means it is visible only in another frame.

3.4 Training strategy

We implement STDC-Net with the Pytorch toolkit. The batch size is set as 8. We adopt the Adam optimizer [20] with \(\beta _1 = 0.9\), \(\beta _2 = 0.99\). The model is trained for 200 epochs with an initial learning rate of \(2 \times 10^{-4}\). We use L1 loss to train STDC-Net. The loss function is expressed as follows,

$$\begin{aligned} \begin{aligned} \mathcal {L}&= \Vert I_t - \hat{I}_t \Vert _1 \end{aligned} \end{aligned}$$
(4)

where \(I_t\) is the output of STDC-Net, \(\hat{I}_t\) is the ground truth.

4 Experiments

4.1 Experimental settings

Dataset

As our goal is to perform frame interpolation for video conferences, we chose to use commonly used facial datasets for training and testing. We use the Voxcele2 [5] dataset to train STDC-Net. It is a large-scale speaker recognition dataset contains over 1 million utterances from 6000 speakers. The video frame rate of Voxceleb2 is 25FPS and the resolution is 224 \(\times \) 224. We select 20 thousand videos from it to be the train set, the video duration is 1 second. For data augmentation, we randomly perform vertical and horizontal flipping, as well as reversing sequence order.

We assess the performance of the models on Voxcele2 [5] and HDTF [48] datasets. Similarly, we select 3000 triplet sequences from Voxcele2 to be test set. HDTF is a large-scale video dataset designed for talking face generation tasks. The video frame rate is 30FPS, and the resolution of the original video is 720P or 1080P. We use a landmark detector [18] to crop the face region. The size of the final images is 512 \(\times \) 512. We select 2993 triplet sequences from it to be test set.

Evaluation Metrics

We use the commonly used reconstruction metrics Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) [40], Multi-Scale Structural SIMilarity (MS-SSIM) [41] and Learned Perceptual Image Patch Similarity (LPIPS) [47] to evaluate the quality of the interpolated frames. Higher PSNR, SSIM and MS-SSIM indicate better performance, while lower LPIPS indicates better performance.

4.2 Comparison with state-of-the-art method

We evaluate the proposed method against the state-of-the art video frame interpolation methods: SuperSloMo [15], CAIN [4], AdaCof [23](baseline of the proposed method), XVFI [38], VFIT [37] and FLAVR [17].

  • SuperSloMo [15] is a flow-based method that employs a linear combination of bidirectional optical flows and utilizes a Unet architecture to further refine them, resulting in the generation of the final intermediate frame.

  • CAIN [4] utilizes the PixelShuffle operator and incorporates channel attention to implicitly capture motion information in order to generate the intermediate frame.

  • AdaCof [23] is a kernel-based method that extracts spatial features using 2DCNNs and utilizes these features to predict offsets, weights, and masks for generating the intermediate frame.

  • XVFI [38] is based on a recursive multi-scale shared structure, combining both linear approximation and flow reversal techniques to obtain the final optical flows. It also employs a Unet architecture to predict masks for generating the intermediate frame.

  • VFIT [37] is a Transformer-based architecture that leverages attention mechanisms to extract spatial-temporal features. These features are then used to predict weights, offsets, and masks, which are utilized to generate the intermediate frame.

  • FLAVR [17] utilizes 3D convolutions to extract spatial-temporal features and employs 3D transpose convolutions for feature decoding, resulting in the generation of the intermediate frame.

Quantitative Results

We calculate the values of PSNR, SSIM, and MS-SSIM for each algorithm on Voxceleb2 and HDTF test sets. The results are shown in Table 1. According to the results presented in the table, it can be observed that the proposed STDC-Net outperforms all the compared methods on both Voxceleb2 and HDTF datasets. Notably, VFIT achieves the second-best performance on Voxceleb2 test set, while XVFI obtains the second-best performance on HDTF dataset. In comparison to VFIT, STDC-Net achieves PSNR improvements of 0.051dB on Voxceleb2 test set, while compared with XVFI, it achieves PSNR improvements of 0.098dB on HDTF test set. Furthermore, STDC-Net achieves PSNR improvements of 0.13dB and 0.17dB over the baseline AdaCof on the Voxceleb2 and HDTF test sets, respectively.

Table 1 Quantitative comparisons on the Voxceleb2 and HDTF datasets. The best result is bold, while the second best is underlined

We also measure LPIPS for each algorithm, the results are presented in Fig. 5. As can be observed from the results in Fig. 5, AdaCof and our proposed STDC-Net achieve the best performance on the Voxceleb2 and HDTF test sets, respectively. All qualitative experimental results obtained on both Voxceleb2 and HDTF test sets exhibit the effectiveness of our proposed STDC-Net for facial video frame interpolation.

Qualitative Results

We present qualitative comparisons on Voxceleb2 and HDTF datasets in Figs. 6 and 7. As shown in these figures, when large and nonlinear motion as well as occlusion occurs, the proposed method generates visually more pleasing results with clearer structures and fewer distortions compared to all compared approaches. We also present qualitative evaluation results showcasing scenarios with simultaneous occurrences of large motion and occlusion in Fig. 8. From Fig. 8, it can be observed that the person is blinking while speaking, leading to substantial motion and occlusion in the eye and mouth regions. The experimental results from Fig. 8 demonstrate that our method achieves favorable performance. To further evaluate the accuracy of our interpolation results, we present error maps of the interpolated frames in Fig. 9. The colors closer to red indicate larger errors, while those closer to blue indicate smaller errors. As shown in Fig. 9, when the man moves his head, the proposed method and SuperSloMo [15] achieve the best and second best performance, respectively. Moreover, SuperSloMo [15] predicts inaccurate optical flow that leads to errors in the neck boundary, while the proposed method generates fewer errors in this regard. These findings further demonstrate the effectiveness of our proposed method for video frame interpolation.

We also visualize the offsets and masks of the proposed method and the baseline (AdaCof [23]). Specifically, we visualize the offsets by weighting the sum of the backward and forward offset vectors for each pixel and calling it as MeanFlow. It can be written as follows,

$$\begin{aligned} \begin{aligned} F(x,y)&= \sum ^{N-1}_{k=0} \sum ^{N-1}_{l=0} W_{k,l}(x,y)(\alpha _{(k,l)}, \beta _{(k,l)}) \end{aligned} \end{aligned}$$
(5)

The visual results are presented in Fig. 10, which demonstrate the effectiveness of the proposed STDC-Net for facial video frame interpolation. Specifically, the depicted scenario involves a moving head and speaking subject, which poses a challenging task for accurate frame interpolation. The proposed method outperforms AdaCof [23] in terms of predicting more accurate offsets and masks. All qualitative results not only corroborate the effectiveness of the proposed method but also demonstrate the potential of our approach in addressing complex and challenging video frame interpolation tasks.

4.3 Ablation study

To assess the effectiveness of our proposed method, we conduct ablation studies on Voxceleb2 and HDTF datasets by varying the network architecture.

Fig. 5
figure 5

Comparison of Learned Perceptual Image Patch Similarity (LPIPS) among each approach on Voxceleb2 and HDTF test sets, where lower bars indicate better performance

Fig. 6
figure 6

Visual comparisons with state-of-the-art methods on the Voxceleb2 dataset, where (a) and (b) are the input frames, (c) to (h) are the intermediate frame generated from SuperSloMo[15], CAIN[4], AdaCof[23], XVFI[37], VFIT[36], and FLAVR[17], respectively, (i) is the intermediate frame generated from our method, (j) is the Ground Truth

Fig. 7
figure 7

Visual comparisons with state-of-the-art methods on the HDTF dataset, the best result is bold, while the second best is underlined, where (a) and (b) are the input frames, (c) to (h) are the intermediate frame generated from SuperSloMo[15], CAIN[4], AdaCof[23], XVFI[38], VFIT[37], and FLAVR[17], respectively, (i) is the intermediate frame generated from our method, (j) is the Ground Truth

Fig. 8
figure 8

Qualitative evaluation with to large motion and occlusion, the best result is bold, while the second best is underlined, where (a) and (b) are the input frames, (c) to (h) are the intermediate frame generated from SuperSloMo[15], CAIN[4], AdaCof[23], XVFI[38], VFIT[37], and FLAVR[17], respectively, (i) is the intermediate frame generated from our method, (j) is the Ground Truth

Fig. 9
figure 9

Error maps comparisons with state-of-the-art methods on the HDTF dataset, where (a) is the overlapped result of input frames, (b) to (h) are the intermediate frame generated from SuperSloMo[15], CAIN[4], AdaCof[23], XVFI[38], VFIT[37], and FLAVR[17], and our method, respectively

Fig. 10
figure 10

The visualizations of the network outputs are compared to AdaCof [23]. In the second to fourth columns, the first and third rows display the visualization results of AdaCof, while the second and fourth rows display the visualization results of the proposed method, where (a) is the overlapped result of input frames, (b) and (c) are the bi-directional optical flows, (d) and (c) are the masks which weight the intermediate frames generated from \(I_0\) and \(I_1\)

To validate the efficacy of spatial-temporal features in video frame interpolation, we trained a model (first row of Table 2, Fig. 11 (e) Model V1 and Fig. 12 (e) Model V1) that exclusively extracts spatial features. To achieve this, we replaced the 3D convolutions in embedding layer and downsample layers with 2D convolutions, while the TFE blocks were removed from the STFE blocks. The experimental results depicted in Table 2 indicate that the proposed STDC-Net outperforms the Model V1 with respect to various performance metrics (PSNR and SSIM). Particularly, the proposed STDC-Net demonstrates an improvement in PSNR values of 0.16dB and 0.26dB on the Voxceleb2 and HDTF test sets, respectively.

To validate the effectiveness of combining CNNs and MLPs in video frame interpolation, we trained another model (second row of Table 2, Fig. 11 (f) Model V2 and Fig. 12 (f) Model V2), which utilizes ResNet-3D (R3D) [39] to extract multi-scale deep spatial-temporal features. The experimental results are presented in Table 2, indicating that compared to Model V2, our proposed STDC-Net achieves the best performance in terms of PSNR and SSIM metrics. Specifically, it achieves an improvement in PSNR values of 0.06dB and 0.07dB on Voxceleb2 and HDTF test sets, respectively.

Table 2 The result of ablation study on Voxceleb2 and HDTF datasets. The best result is bold, while the second best is underlined
Fig. 11
figure 11

Visual comparison of ablation studies with head motion and small mouth motion, where (a) is the overlapped result of the input frames, (b) and (c) are the input frames, (d) is the Ground Truth, (e) to (g) are the intermediate frame generated from Model V1, Model V2 and the proposed method, respectively

Fig. 12
figure 12

Visual comparison of ablation studies with large head motion and large mouth motion, where (a) is the overlapped result of the input frames, (b) and (c) are the input frames, (d) is the Ground Truth, (e) to (g) are the intermediate frame generated from Model V1, Model V2 and the proposed method, respectively

We conducted additional visual comparisons on the HDTF dataset, as presented in Figs. 11 and 12. In both figures, we observe that the two individuals are simultaneously nodding and speaking. From the visual comparisons, it can be observed that the intermediate frame generated by Model V1 exhibits the worst performance, with severe distortions in the eye and mouth regions. On the other hand, the intermediate frame generated by Model V2 exhibit better results in the eye region, but the results in the mouth region are not satisfactory. In contrast, our proposed method generates intermediate frames with less distortion and clearer contours in the mouth and eye regions, which are closer to the ground truth intermediate frames. The above experimental results demonstrate the effectiveness of extracting spatial-temporal features and combining CNNs with MLPs in video frame interpolation.

5 Conclusion

In this paper, we have proposed a Spatial-Temporal Deformable Convolution Network (STDC-Net) for conference video frame interpolation to address the issue of dropped frames or reduced frame rates in video conference communication, which improves the video quality and creates a more immersive and engaging experience for all participants. In particular, STDC-Net extracts spatial-temporal features by splitting feature maps along horizontal, vertical, and temporal pathways and processing them with different MLPs in multi scales. Based on the spatial-temporal features the model generates the intermediate frame by predicting weights, offsets, and masks. By setting different split lengths in different scales, the proposed STDC-Net can extract both local and global spatial features. This enables the model to handle challenging scenarios such as moving subjects, facial expressions, and background changes, which are common in real-world video conference communication. The experiments have shown that the proposed model outperforms many state-of-the-art methods with gains, where compare to baseline the PSNR improves 0.13dB and 0.17dB on Voxcele2 and HDTF datasets, respectively.