Video Highlight Detection via Deep Ranking Modeling
The video highlight detection task is to localize key elements (moments of user’s major or special interest) in a video. Most of existing highlight detection approaches extract features from the video segment as a whole without considering the difference of local features both temporally and spatially. Due to the complexity of video content, this kind of mixed features will impact the final highlight prediction. In temporal extent, not all frames are worth watching because some of them only contain background of the environment without human or other moving objects. In spatial extent, it is similar that not all regions in each frame are highlights especially when there are lots of clutters in the background. To solve the above problem, we propose a novel attention model which can automatically localize the key elements in a video without any extra supervised annotations. Specifically, the proposed attention model produces attention weights of local regions along both the spatial and temporal dimensions of the video segment. The regions of key elements in the video will be strengthened with large weights. Thus more effective feature of the video segment is obtained to predict the highlight score. The proposed attention scheme can be easily integrated into a conventional end-to-end deep ranking model which aims to learn a deep neural network to compute the highlight score of each video segment. Extensive experimental results on the YouTube dataset demonstrate that the proposed approach achieves significant improvement over state-of-the-art methods.
KeywordsVideo highlight detection Attention model Deep ranking
This work is supported in part by the National Natural Science Foundation of China under Grant 61432019, Grant 61572498, Grant 61532009, and Grant 61772244, the Key Research Program of Frontier Sciences, CAS, Grant NO. QYZDJ-SSW-JSC039, the Beijing Natural Science Foundation 4172062, and Postgraduate Research & Practice Innovation Program of Jiangsu Province, Grant NO. SJCX17_0599.
- 1.Liu, S., Wang, C.H., Qian, R.H., Yu, H., Bao, R.: Surveillance video parsing with single frame supervision. arXiv preprint arXiv:1611.09587 (2016)
- 2.Liu, S., Liang, X.D., Liu, L.Q., Shen, X.H., Yang, J.C., Xu, C.S., Lin, L., Cao, X.C., Yan, S.C.: Matching-CNN meets KNN: quasi-parametric human parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1419–1427 (2015)Google Scholar
- 4.Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: ECCV (2014)Google Scholar
- 6.Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for TV baseball programs. In: Proceedings of the 8th ACM International Conference on Multimedia 2000, Los Angeles, CA, USA, 30 October–3 November 2000, pp. 105–115 (2000)Google Scholar
- 7.Nepal, S., Srinivasan, U., Graham, J.R.: Automatic detection of goal segments in basketball videos. In: Proceedings of the 9th ACM International Conference on Multimedia 2001, Ottawa, Ontario, Canada, 30 September–5 October 2001, pp. 261–269 (2001)Google Scholar
- 9.Tong, X.F., Liu, Q.S., Zhang, Y.F., Lu, H.Q.: Highlight ranking for sports video browsing. In: Proceedings of the 13th ACM International Conference on Multimedia, Singapore, 6–11 November 2005, pp. 519–522 (2005)Google Scholar
- 12.Borth, D., Ulges, A., Schulze, C., Thomas, M.B.: Keyframe extraction for video tagging & summarization. In: Informatiktage 2008. Fachwissenschaftlicher Informatik-Kongress, 14–15 März 2008, B-IT Bonn-Aachen International Center for Information Technology in Bonn, pp. 45–48 (2008)Google Scholar
- 14.Lin, Y.L., Vlad, I.M., Winston, H.H.: Summarizing while recording: context-based highlight detection for egocentric videos. In: 2015 IEEE International Conference on Computer Vision Workshop, ICCV Workshops 2015, Santiago, Chile, 7–13 December 2015, pp. 443–451 (2015)Google Scholar
- 15.Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first- person video summarization, pp. 982–990 (2016)Google Scholar
- 16.Jia, Y.Q., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Ross, B.G., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/1408.5093 (2014)Google Scholar
- 19.Krizhevsky, A., Sutskever, I., Geoffrey, E.H.: ImageNet classification with deep convolutional neural networks. In: 26th Annual Conference on Neural Information Processing Systems, pp. 1106–1114 (2012)Google Scholar
- 20.Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. volume abs/1212.0402 (2012)Google Scholar