Keywords

1 Introduction

In the field of video retrieval, the accuracy of shot segmentation will directly affect the performance of video retrieval systems. Therefore, how to improve the accuracy of shot segmentation is one of the difficult problems in video analysis [1]. In multimedia information, videos can be divided into four different levels of frames, shots, scenes, and video streams according to different granularities. The shot refers to a set of consecutive frames of a camera that is continuous in time and space [2]. So the shot segmentation task is to divide a complete video into segments based on the shots. The shots can be mainly divided into the following two types: abrupt shots and gradual shots. An abrupt shot means that there is no obvious transition between two discrete frames; A gradual shot refers to a transition between two discontinuous frames, such as fade in, fade out, and dissolve. Therefore, our goal is to accurately find the boundaries of the two types of shot switching.

2 Related Work

The common shot segmentation method is mainly based on the difference value between two adjacent frames. Jinlai [3] used multi-feature fusion for video shot segmentation. Biswas [4] combined local similarity and global features, and used the method of matrix cosine similarity to detect the shot boundary. Mohanta [5] used the local feature-based frame transition parameters and frame estimation errors to achieve shot segmentation. In order to solve the copyright protection problem of multimedia video, Shang [6] proposed a motion vector based shot segmentation algorithm, and embedded the watermark in a suitable location to better protect the video. Chongke [7] proposed a shot boundary detection framework based on dynamic mode decomposition, which reduced the error detection rate. Baraldi [8] used hierarchical clustering for broadcast video for shot and scene detection. The above method has a high detection accuracy for shot switching in different scenarios, but the detection effect of the shot switching in the same scene is poor. In addition, the above methods have a low detection accuracy for the gradual shots, therefore, it is necessary to improve the accuracy of the shot segmentation.

As one of the most important channels of multimedia video, vision is also the main way of human cognition [9]. Under the constraints of visual physiology and visual sensitivity, the visual sensitivity of human eye to distinguish the details of the object is basically same for the similar things. The vision system is insensitive to absolute brightness, but is sensitive to color contrast. In the process of cognizing images, color plays an important role that can be directly perceived and conveyed by viewers [10]. According to the principle of visual continuity [11], vision tends to perceive continuous forms rather than discrete pieces. People use the same color to make visual recognition of continuous images. When the shot is switched, the color distribution of the video frame has a large difference, so we employ color as an important factor in the determination of the shot segmentation. In addition, according to cognitive psychology research, humans have different degrees of interest in different regions of the video, namely the visual attention mechanism [12]. In order to simulate the degree of interest of the biological vision system in different regions of the video image, this paper studies the influence of the video frame blocking strategy on the detection of the abrupt shots. For the detection of the gradual shots, a significant change occurs in the brightness of the video frame since the post-processing (fading, etc.) is added between the two discontinuous frames. Therefore, this paper studies the gradual shot detection method based on brightness information in view of human visual perception of brightness information [13].

The research content of the paper has the following aspects: 1. Research on the detection method of abrupt shot based on visual cognition mechanism. As for abrupt shot detection, this paper proposed a visual color block histogram detection method, which effectively solved the problem of low accuracy caused by shot switching in the same scene. 2. Research on the detection method of gradual shot based on visual cognition mechanism. In the aspect of gradual shot detection, the paper proposed a long-time difference detection method based on brightness information, which effectively improved the detection accuracy of the gradual shot.

3 Detection Method of Abrupt Shot Boundaries

3.1 Color Histogram Method

When detecting the abrupt shot, we can emply the difference of two continuous frames since the different shots are discontinuous in space and time, and there is no post-processing (fading, etc.) between the two frames of the abrupt shot. The abrupt shots in the video are shown in Figs. 1 and 2.

Fig. 1.
figure 1

Giagram of the abrupt shot A

Fig. 2.
figure 2

Giagram of the abrupt shot B

When we describe the difference between adjacent video frames, the common metric is the color histogram method [14], which is insensitive to camera shake and motion of objects within the shot. The color histogram is a statistical table reflecting the color distribution of an image pixel. The abscissa indicates the interval of each different size, and the vertical axis indicates the percentage of the total number of pixels in the image in a certain interval. It describes the proportion of different colors in the entire image, and does not care about the spatial location of each color. The HSV color space [15] is closer to the way humans feel color, encapsulating information about colors. Therefore, we first convert the original RGB color space to the HSV color space. The HSV model has three parameters: Hue, Saturation, and Value. The hue is the basic property of the color. The saturation refers to the purity of the color. The higher the value, the purer the color. The value means the brightness of the color. The conversion method from RGB color space to HSV color space is as follows:

The formula for calculating the color histogram of an image is shown as follows:

$$ H\left( k \right) = \frac{{n_{k} }}{N},k = 0,1, \ldots ,255 $$
(1)

In the above formula, \( {\text{n}}_{\text{k}} \) represents the number of pixels whose pixel value is k in the image, N refers to the total number of pixels in an image, and H(k) means the distribution of color histograms (Fig. 3).

Fig. 3.
figure 3

HSV color model diagram

3.2 Visual Color Block Histogram Strategy

Because of the frame images under different shots are similar in color in a single scene, people usually judge the situation by the region of interest rather than the overall color distribution of the image. Therefore, taking the color histogram of the entire image as a feature will result in a great error. Aiming at the situation, this paper proposed a segmentation method based on visual cognition mechanism, which is block color histogram. The block color histogram means that we divide the video frame firstly, and then compute the color histogram of each block. Finally, each color histogram is weighted to obtain the color distribution of the entire image. In order to highlight the main content of the image and reduce the influence of the background in the image, a larger weight is given to the main body region in the middle of the image, and the remaining background regions are given a smaller weight. In an image, since the four corners and the upper boundary are in the background area, a lower weight is given to the area. Sometimes the characters may occupy a vertical space in the multimedia video, so the left and right sides should be paid more attention. Therefore, the weights are assigned according to the scale shown in Fig. 4. The numbers in the rectangle represent the weight of each small block, while the 1:4:1 in the outer box represents the proportional relationship between the length and width.

Fig. 4.
figure 4

Video frame block diagram

After the image frames in video are segmented, the color histogram difference of a small block of images in two adjacent frames on a single channel is calculated, as shown in the following formula:

$$ d_{m} \left( {i,j} \right) = \frac{1}{2}\sum\nolimits_{k = 0}^{255} {\left| {H_{im} \left( k \right) - H_{jm} \left( k \right)} \right|} $$
(2)

In the above formula, \( H_{im} \left( k \right) \) represents color distribution of the i-th frame image on the m-th block, \( H_{jm} \left( k \right) \) refers to color distribution of the j-th frame image on the m-th block. Then, we calculate the color histogram difference of the adjacent two frames on a small block:

$$ D_{m} \left( {i,j} \right) = \frac{1}{3}\left( {d_{Hm} \left( {i,j} \right) + d_{Sm} \left( {i,j} \right) + d_{Vm} \left( {i,j} \right)} \right) $$
(3)

Finally, the color histogram difference of the adjacent two frames is shown:

$$ T\left( {i,j} \right) = \frac{{\sum\nolimits_{m = 1}^{9} {w_{m} *D_{m} \left( {i,j} \right)} }}{{\sum\nolimits_{m = 1}^{9} {w_{m} } }} $$
(4)

When the color histogram difference value T(i, j) reaches a threshold, it is determined to be an abrupt shot. The choice of the threshold q is adaptive according to different types of video. The specific calculation method is as follows:

$$ q = \alpha *\frac{{T\left( {1,2} \right) + T\left( {2,3} \right) + \ldots + T\left( {i ,i + 1} \right) + \ldots T (N - 1 ,N )}}{N - 1} $$
(5)

In the above formula, T(i, i + 1) is the color histogram difference between video frame i and next frame. N refers to total number of frames, α is the coefficient and experiments show that α is suitable for 5 ~ 6.

figure a

4 Detection Method of Gradual Shot Boundaries

In the detection of the gradual shot, since the transition forms such as fade in, fade out and dissolve are added between the two frames, the difference between their adjacent frames becomes smaller, which brings challenges to the shot segmentation. The gradual shots in the videos are shown in Figs. 5 and 6.

Fig. 5.
figure 5

Giagram of the gradual shot A

Fig. 6.
figure 6

Giagram of the gradual shot B

Since the brightness information changes between adjacent video frames are small when the gradual shot occurs, it is difficult for us to judge by the difference between the brightness information of adjacent frames. However, since the human visual system perceives regular changes in brightness information within a continuous number of frames, we can capture the change in brightness information. Based on this, this paper proposed a long-time difference method based on brightness to detect gradual shots. In this paper, a sliding window with a length of 6 frames is constructed, which slides from the front to the rear to observe the change of brightness difference of the video frames in the sliding window.

Similarly, we calculate the color histogram difference between adjacent frames:

$$ d\left( {i,j} \right) = \frac{1}{2}\sum\nolimits_{k = 0}^{255} {\left| {H_{i} \left( k \right) - H_{j} \left( k \right)} \right|} $$
(6)

Since we only use the information on the brightness channel when performing gradual shot detection, the difference between adjacent frames can be calculated as:

$$ T\left( {i,j} \right) = d_{v} \left( {i,j} \right) $$
(7)

It is assumed that successive frames in a sliding window in the video are frames i −5, …, i −2, i −1, i, respectively. This paper first calculates the HSV color histogram of each frame image, and then computes the difference value T(i, i  1), T(i, i  2), T(i, i  3), T(i, i  4), T(i, i  5). If the brightness of the image frames satisfied T(i, i  1) < T(i, i  2) < T(i, i  3) < T(i, i  4) < T(i, i  5), and T(i, i  1) is greater than a certain threshold, it is determined to be a gradual shot. The selection method of the threshold is as shown in the formula 5.

figure b

5 Experiments and Results

5.1 Dataset and Evaluation Criteria

Videos to be detected are derived from TRECVID [16], which is the internationally authoritative dataset in the field of video detection. This paper used some of the videos published on the Open Video website. The following is the video information to be detected (Table 1):

Table 1. Test dataset information

In the task of shot segmentation, the evaluation benchmark is generally Precision and Recall. Precision refers to the percentage of the correct number of detected shots in the total number of detected shots, and the recall is the percentage of the number of correctly detected shots in the actual total number of shots. They are defined as follows:

$$ P = \frac{TP}{TP + FP} $$
(8)
$$ R = \frac{TP}{TP + FN} $$
(9)

In the above formula, TP represents the number of correctly detected shots boundaries, FP means the number of incorrectly detected shots boundaries, and FN refers to the number of undetected shots boundaries. In addition, F1 value is defined as follows, which is the average of recall and precision.

$$ F1 { = }\frac{2 *P *R}{ P + R } $$
(10)

5.2 Results and Analysis of Abrupt Shot Detection

When detecting the abrupt shots, the paper compared the traditional color histogram method, the method of other paper and the histogram method based on visual color block proposed in this paper. The experiment results are shown in Tables 2, 3 and 4:

Table 2. Anni005 video abrupt shots results
Table 3. BOR08 video abrupt shots results
Table 4. Anni009 video abrupt shots results

As can be seen from the above experimental results, video frames are segmented and given a large weight to the main area in the abrupt shot detection, which increase the difference between discontinuous frames and improves the effect of shot segmentation. Especially in the block before and after, that is, the traditional color histogram method and the method in the paper on the comparison of experimental results, we can clearly see the advantages of the block color histogram proposed in this paper.

5.3 Results and Analysis of Gradual Shot Detection

When detecting the gradual shots, the paper compared the traditional color histogram method, the method of other paper and the long-difference method based on brightness information proposed in this paper. The experiment results are shown in Tables 5, 6 and 7.

Table 5. Anni005 video gradual shots results
Table 6. BOR08 video gradual shots results
Table 7. Anni009 video gradual shots results

As can be seen from the above experimental results, in the detection of gradual shots, due to the small difference between the two frames, it is difficult to detect shot boundaries and recognition rate is lower than that of the abrupt shots. However, the long-time difference strategy based on brightness information proposed in the paper has a significant improvement over the traditional color histogram method. In addition, the employment of brightness information can also have a good tolerance for camera shake and motion of objects within a shot. However, in the detection of gradual shots, the scenes (sunset, etc.) in which the brightness gradually changes are hard to detect, so there is still room for improvement.

6 Conclusions

Shot segmentation task is the premise of later video information retrieval, so the accuracy of shot segmentation will directly affect the performance of video retrieval system. The paper first introduced the visual cognitive mechanism and human perception of color. Then, based on the visual cognitive mechanism, a histogram method of visual color segmentation was proposed. This method strengthens the region of interest of biological vision system, weakens the influence of background, and improves the accuracy of abrupt shot detection. On the other hand, a long-time difference method based on brightness information was proposed to detect the gradual shots. The method improves the detection accuracy by capturing the perception rule of human vision for brightness. In addition, the paper used the method of window sliding, which is easy to achieve. Compared with other shot segmentation methods, the precision and recall are improved. The method can be used as an effective shot segmentation strategy. Due to the high computational complexity when comparing the difference between adjacent frames, variable step size method will be considered to speed up the shot segmentation in the future.