1 Introduction

With the rapid development of the Internet and video display devices, the video multimedia services have attracted great interest of people. Compared with the traditional 2D video, 3D video can provide more realistic scene. Hence, the research of 3D video is regarded as a hot area and lots of progresses have been made. As one of the 3D video formats [1, 2], texture video plus depth map (V + D) is able to provide 3D scene by Depth Image Based Rendering (DIBR) technique [3], which can synthesis a virtual view based on a texture video and its associated depth map. Accordingly, it is quite necessary to transmit the depth map in addition to the texture video.

As an extension of the H.264/AVC standard [4], Scalable Video Coding (SVC) can provide some significant functionalities by dividing the bit stream into one base layer (BL) bit stream and several enhancement layers (ELs) bit streams [5]. In the process of transmission, the decoder can get an optimal bit stream through selectively discarding enhancement layer bit streams. In order to provide scalable 3D scene in the display side, the texture video and its associated depth map should be transmitted to the display device simultaneously. One simplest way is to encode them separately in the SVC encoder and then synthesis the 3D video in the display side. However, this method doesn’t exploit the correlation between texture video and its associated depth map. More redundant information could be removed and the compression efficiency could be further improved when they are jointly encoded.

In the process of 3D video coding, it can effectively improve the coding efficiency of the multi-view video transmission by adding the depth map. So, there were many studies about depth map coding. In [6], Tao et al. proposed a novel method of compressing the texture video and depth map jointly by using the correlation of the motion field and brought significant coding gain for the associated depth map. In [7], Lei et al. proposed a new depth map coding scheme based on the structure and motion similarities between the texture video and its associated depth map, a new type of block named OD-Block and a DTCC based prediction method are presented for the depth map coding, which can improve the coding efficiency and the rendering quality. The similarity of motion vectors between texture video and depth map were also used for the depth map coding in [8, 9].

Besides, some new edge-preserving algorithms were proposed for the reason that boundary regions play an important role in view rendering [1013]. In [10], Zamarin et al. proposed a new edge-preserving intra mode for depth macroblocks (MBs) to improve the compression efficiency and the rendering quality. Kao and Tu proposed a novel compression method which includes an edge detection module, a homogenizing module and a compression encoding module for depth map in [11].

However, those methods mentioned above didn’t combine the importance of the depth map edge and the correlation between texture video and its associated depth map when encoding the texture video and depth map in a scalable way. In this paper, a depth map coding method in the SVC is studied. A new depth map edge detection algorithm and a dynamic quantization algorithm based on the edge regions are proposed.

The rest of this paper is organized as follows. First, in Sect. 2, the proposed single-view video plus depth map coding method is described in detail. Then, the simulation results are shown in Sect. 3. Finally, Sect. 4 concludes the paper with a summary.

2 The Proposed Method

The depth map corresponding to the texture video, as shown in Fig. 1, can be regarded as a special color image whose pixel value represents the distance between the camera and the object. Besides, the depth map consists of two parts, namely boundary region(BR) and homogeneous region(HR). Specifically, the depth map is used for view rendering, rather than being displayed in the display devices.

Fig. 1.
figure 1

The video plus depth format of 3D video: (a) the texture video; (b) the depth map.

Compared with the SVC scheme of the 2D texture video, there’s no doubt that it’s very worthy of studying the depth map coding method in the SVC scheme of video plus depth. So in this section, the SVC scheme of video plus depth, which is compatible with quality scalable and stereoscopic display, is firstly presented and the inter-layer prediction is also discussed. Then, the quantization algorithm based on edge detection for depth map coding is analyzed and studied to improve the quality of the depth map.

2.1 SVC Scheme of Video Plus Depth

Tao et al. in [6] encoded the single-view video plus depth map into two layers, one is the BL whose input is the texture video, and the other is the EL whose input is the depth map. Based on Tao’s scheme, the improved encoding scheme proposed in this paper is illustrated in Fig. 2. Here, the three-layer coding structure is used.

Fig. 2.
figure 2

The encoding scheme.

In the BL, the hybrid temporal and spatial prediction mode is used to encode the texture video to get the basic video quality in Fig. 2. EL1, the newly added layer in our structure, is a texture quality enhancement layer. The data in EL1 is the refinement of texture information. When the decoder gets BL or/and EL1 bit stream, 2D video with different qualities can be displayed. In EL2, the depth map is input. When the decoder gets the whole bit stream that includes the BL, EL1 and EL2, 3D scenes are available by using the DIBR technique. Therefore, compared with Tao’s scheme, our encoding scheme can realize not only the quality scalable but also stereoscopic display.

In the SVC scheme, inter-layer prediction is used, which is a prediction tool added to spatial SVC [5]. Specifically, inter-layer prediction can be divided into three types: inter-layer motion prediction, inter-layer residual prediction and inter-layer intra prediction. When encoding the enhancement layer, if the co-located MB in the reference layer is inter-coded, then the current MB is also inter-coded, and its motion vectors and reference indexes can be derived from the co-located MB in the reference layer by inter-layer motion prediction. Besides, quality SVC can be considered as a special case of spatial SVC with the same video sizes for BL and ELs, so the inter-layer prediction can also be applied for it.

When encoding the EL1, the above three inter-layer prediction tools are all applied to improve the coding efficiency. In fact, it is the 2D quality SVC with BL and EL1, and the coding scheme is the same as the H.264/SVC, so it isn’t discussed in this paper. When encoding the EL2, only the inter-layer motion prediction is applied to improve the coding efficiency of the depth map. There are several reasons for this. Firstly, it is easy to see that the structure between the texture video and the depth map is very similar from Fig. 1. Secondly, because the texture video and its associated depth map are shot in the same place and at the same time, they may have similar motion vectors. Actually, lots of researchers have already verify the motion similarity between them and the coding efficiency is improved a lot by using the similarity, such as the papers mentioned above [69]. Thirdly, the inter-layer residual prediction and inter-layer intra prediction are not applied because the residual information and pixel value of the depth MB is quite different from the texture MB.

2.2 Depth Map Edge Detection

Due to the effect of BR of the depth map on the quality of the synthesized view, the boundary data should be preserved as exactly as possible in the process of encoding. Therefore, for the depth map encoding, we should first extract the boundary area from the depth map. The edge preserving methods mentioned above are all based on the pixel. However, in the H.264/SVC standard, the basic encoding unit is MB and in the process of mode decision, the sizes of the blocks that compose of the MB partition modes can be divided into \( 16 \times 16 \), \( 8 \times 8 \), and \( 4 \times 4 \). Therefore, designing an edge detection algorithm based on blocks of different sizes is more suitable for block based coding scheme than the pixel based edge detection algorithm. So in this paper, a block-based Sobel (BBS) edge detection algorithm is proposed to extract the boundary region. The basic idea of the proposed BBS algorithm is as follows:

The current MB has multiple prediction modes with the block size of \( N \times N \) \( (N = 16,8,4) \) in the process of encoding. In \( N \times N \) block, each pixel’s horizontal and vertical gradient can be calculated according to the conventional Sobel edge detection algorithm and then the joint gradient \( G(x,y) \) can be obtained by the following formula (1):

$$ G(x,y) = \sqrt {G_{x} (x,y)^{2} + G_{y} (x,y)^{2} } $$
(1)

where, \( G_{x} (x,y) \) and \( G_{y} (x,y) \) represent horizontal and vertical gradient, \( (x,y) \) is the coordinate in \( N \times N \) block.

For \( N \times N \) block, the average gradient value of \( (x,y)(x,y \in [1,N - 2]) \) is set as the block-gradient of current \( N \times N \) block. So, the block-gradient of \( N \times N \) block can be expressed by formula (2):

$$ G_{N} = \frac{{\sum\limits_{x = 1}^{N - 2} {\sum\limits_{y = 1}^{N - 2} {G(x,y)} } }}{{(N - 2)^{2} }} $$
(2)

After obtaining the block-gradient of \( N \times N \) block, it is compared with a threshold (Thr), which is set to 15 based on the empirical value. If \( G_{N} > Thr \), then the current \( N \times N \) block is in the BR, otherwise it’s in the HR. Breakdancers sequence and Ballet sequence are experimented with the proposed BBS edge detection algorithm, and the results are shown in Figs. 3 and 4.

Fig. 3.
figure 3

Results of Ballet sequence: (a) the original depth map; (b) BR with 4 × 4 block size; (c) BR with 8 × 8 block size; (d) BR with 16 × 16 block size.

Fig. 4.
figure 4

Results of Breakdancers sequence: (a) the original depth map; (b) BR with 4 × 4 block size; (c) BR with 8 × 8 block size; (d) BR with 16 × 16 block size.

As we can see from Figs. 3 and 4, the proposed BBS edge detection algorithm can extract the BR of the depth map in a relatively accurate manner, and the smaller of the block size is, the more accurate the BR will be.

2.3 Dynamic Quantization

In the process of quantization, if the QP of the texture video has been decided, it’s quite important to choose a suitable QP for the depth map to get a good quality of the synthetic view. When the bit rate of the depth map accounts for 10 %–20 % of the bit rate of the texture video, a good quality of the depth map can be gotten [14]. So some experiments are carried out to choose a better QP for the depth map based on the distribution of quantitative parameters for texture and depth. In our experiments, four groups of QP are selected for texture videos, (28,24), (32,28), (36,32) and (40,36) respectively. And in each group, the first parameter is for the BL and the second parameter is for the quality EL1. Three representative sequences are tested in our experiments, including Ballet, Breakdancers and Newspaper. The first two sequences have violent motion and the last one has slow motion. For the QP of the depth map, it is obtained by adding the QP Difference to the QP for the texture video of BL. In our experiments, we take the quality of the synthetic view as an evaluation criterion for the reason that the purpose of getting a high quality depth map is to get a high quality synthetic view. The experimental results can be seen in Table 1.

Table 1. Experimental results for selecting an optimal QP for the depth map encoding.

In Table 1, the QP Difference (may be 3, 5, 7, 9) means the difference of the QP between the BL and EL2. Bitrates is the summation of the bit rate of the texture video and the depth map when encoding them together. SynView PSNR is the PSNR of the synthetic view. According to the experimental results in Table 1, the RD curves can be drawn, as shown in Fig. 5.

Fig. 5.
figure 5

RD curves of different sequences with QP Difference = n: (a) Ballet; (b) Breakdancers; (c) Newspaper.

From Fig. 5, we can see that the synthetic view can obtain the best performance when encoding the Newspaper sequence with QP Difference = 7. For Ballet and Breakdancers, although when the higher PSNR is gained by using QP Difference = 3, the difference of PSNR between QP Difference = 3 and 7 is very small. Besides, the bit rates of the sequences with QP Difference = 7 are smaller than the sequences with QP Difference = 3, so the PSNR difference could be ignored.

Furthermore, the relationship of bit rates between texture videos and depth maps are illustrated in Table 2. From Table 2, we can see that the relationships of bit rates between texture videos and depth maps are basically consistent with [14] except a little difference when QP Difference = 7. This is because the Ballet sequence has the most violent motion, so it needs more bitrate to transmit the depth map, while the newspaper sequence which has the least motion needs less bitrate for the depth map. So, QP Difference = 7 is chosen as the optimal in our experiments.

Table 2. The relationship of bit rates between texture videos and depth maps

In addition, combined with the proposed BBS edge detection algorithm which is applied for mode decision process in EL2, the dynamic quantification for the depth map to improve its edge quality is applied. The flowchart of the BBS edge detection algorithm and dynamic quantification can be seen in Fig. 6.

Fig. 6.
figure 6

The flowchart of the proposed BBS edge detection algorithm.

According to the depth boundary effect on the quality of the synthetic view, BR needs to be preserved as much as possible with a smaller QP while HR can be quantified with a larger QP than the initial QP set in the configuration file. So a dynamic quantization based on the depth BR is proposed, and the final QP can be expressed as:

$$ QP = QP_{init} + \Delta QP_{H/B} $$
(3)

where \( QP_{init} \) is the initial QP set in the configuration file and \( \Delta QP_{H/B} \) is a variable value that can be adjusted based on block sizes and the region. When the current block belongs to BR, \( \Delta QP_{H/B} \) is a negative value, otherwise it is a positive value. And the smaller the block size is, the more important the block will be. So it should be quantified with a small QP. In our experiments, \( \Delta QP_{H/B} \) is set as:

$$ \Delta QP_{H/B} = \left\{ {\begin{array}{*{20}c} { - 2,CurrentBlock \in BR\;and\;16 \times 16} \\ { - 3,CurrentBlock \in BR\;and\;8 \times 8} \\ { - 5,CurrentBlock \in BR\;and\;4 \times 4} \\ {4,CurrentBlock \in HR} \\ \end{array} } \right. $$
(4)

Since different QP are set for the boundary regions and homogeneous regions, the average changes of the amount of the bit stream will be small.

3 Simulation Results

Some experiments are carried out for evaluating the proposed method in SVC reference software JSVM 9–19. Three different video sequences: Ballet, Breakdancers, and Newspaper, are tested in this paper. The main encoding parameters are presented in Table 3.

Table 3. Main encoding parameters for evaluating the proposed algorithm.

The same inter-layer prediction is applied both in our coding scheme and the standard JSVM model. In addition, the initial QP for the depth map is selected by adding 7 to the QP for the texture video, which has been verified in the previous section. The bitrate and PSNR values of the proposed method and the standard three-layer JSVM model without edge detection are shown in Table 4.

Table 4. Comparison of the proposed method and the original JSVM method.

Since the texture video inputted to the BL and EL1 is processed in the same way with the texture video in JSVM standard model, the reconstructed quality of BL and EL1 in our coding method are substantially unchanged comparing with the JSVM standard model. So the bitrates and PSNR in Table 4 represent overall bitrates of all layers and the PSNR of the depth map respectively. The Bjøntegaard bitrates (BDBR, %) and the Bjøntegaard PSNR (BDPSNR, dB) [15] between the proposed method and the original JSVM method are calculated. From Table 4, the proposed method can achieve the Bjøntegaard bitrates savings from 9.72 % to 19.36 % and the BDPSNR gains from 0.463 dB to 0.941 dB.

At last, the RD performance of the three sequences are depicted to evaluate the effectiveness of the proposed method as illustrated in Fig. 7. As we can see, our proposed method performs better than the standard JSVM model.

Fig. 7.
figure 7

RD performance comparisons: (a) Ballet; (b) Breakdancers; (c) Newspaper.

4 Conclusions

In our coding scheme, the texture video is encoded as BL and EL1 while its associated depth map is encoded as EL2. Considering the motion similarity between texture video and its associated depth map, inter-layer motion prediction is applied while the other two inter-layer prediction tools are not applied in the process of encoding the depth map. What’s more, when selecting the optimal prediction mode, our proposed block-based Sobel edge detection algorithm is applied to judge whether the current block is in the boundary region or not. And then, the dynamic quantization is applied in boundary blocks and homogeneous blocks.

Simulation results show that the proposed method has an overall better performance than the standard JSVM model.