This paper presents a novel unsupervised method for identifying the semantic structure in long semi-structured video streams. We identify chains, i.e., local clusters of repeated features from both the video stream and audio transcripts. Each chain serves as an indicator that the temporal interval it demarcates is part of the same semantic event. By layering all the chains over each other, dense regions emerge from the overlapping chains, from which we can identify the semantic structure of the video. We present two clustering strategies that accomplish this task, and compare them against a baseline Scene Transition Graph approach. We then develop a commentator that provides a semantic labeling of the resultant video segmentation.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Matlab: Savitzky-Golay filter, which is a moving average with filter coefficients determined by an unweighted linear least-squares regression and a polynomial model of specified degree (degree 7 used here)
Amir A, Argillander J, Berg M, Chang S-F et al (2004) IBM Research TRECVID-2004 Video Retrieval System. In Proceedings of TRECVID
Babaguchi N, Nitta N (2003) Intermodal collaboration: a strategy for semantic content analysis for broadcasted sports video. In Proceeding of the International Conference on Video Processing
Benini S, Bianchetti A, Leonardi R, Migliorati P (2006) Extraction of significant video summaries by dendrogram analysis. In Proceedings of the International Conference on Image Processing
Bertini M, Del Bimbo A, Nunziati W (2005) Common visual cues for sports highlights modeling. Multimed Tool Appl 27:215–218
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Finkel JR, Grenager T, Manning CD (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings ACL
Hearst MA (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Ling 23(1):33–64
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings CVPR
Li Y, Narayanan S, Kuo CCJ (2004) Content-based movie analysis and indexing based on audiovisual cues. IEEE Trans Circ Syst Video Tech 14(8):1073–1085
Li Y, Lee S-H, Yeh C-H, Kuo C-CJ (2006) Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques. IEEE Signal Process Mag 23(2):79–89
Liu S, Xu M, Li H, Chia L-T, Rajan D (2006) Multimodal semantic analysis and annotation for basketball video. EURASIP J Adv Sig Process 1–13
Nastase V, Strube M (2008) Decoding wikipedia categories for knowledge acquisition. Proceedings AAAI 1219–1224
Ngo C-W, Ma Y-F, Zhang HJ (2005) Video summarization and scene detection by graph modeling. IEEE Trans Circ Syst Video Tech 15(2):296–305
Nitta N, Babaguchi N (2002) Automatic story segmentation of closed-caption text for semantic content analysis of broadcasted sports video. In Proceedings of International Workshop on MM Information Systems, 110–116
Patsis Y, Verhelst W (2008) A speech/music/silence/garbage/ classifier for searching and indexing broadcast news material. In Proceedings of Database and Expert Systems Applications, 585–589
Poulisse GJ, Moens M-F (2010) Unsupervised scene detection in olympic video using multi-modal chains. In Proceedings of CBMI, 103–108
Poulisse GJ, Moens M-F, Dekens T, Deschacht K (2010) News story segmentation in multiple modalities. Multimed Tool Appl 48:3–22
Quenot G, Moraru D, Ayache S, Charhad M, Guironnet M, Carminati L, Mulhem P, Gensel J, Pellerin D, Besacier L (2004) CLIPS-LIS-LSR-LABRI experiments at TRECVID 2004. In Proceedings of TRECVID
Sadler, DA, O’Connor N (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circ Syst Video Tech 1225–1233
Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Trancoso I (2009) Multi-modal scene segmentation using scene transition graphs. In Proceedings of ACM Multimedia 665–668
Skorochod’ko EF (1972) Adaptive method of automatic abstracting and indexing. Inf Process 71:1179–1182
Vasconcelos N, Lippman A (2000) Statistical models of video structure for content analysis and characterization. IEEE Trans Image Process 9(1):3–19
Wang Y, Liu Z, Huang J-C (2002) Multimedia content analysis-using both audio and visual clues. IEEE Signal Process Mag 17(6):12–36
Xu C, Wang J, Wan K, Li Y, Duan L (2006) Live sports event detection based on broadcast video and web-casting text. In Proceedings ACM Multimedia
Xu C, Zhang Y-F, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355
Xu M, Xu C, Duan L (2008) Audio keywords generation for sports video analysis. ACM Trans Multimed Comput Comm Appl 4(2):article 11
Yeung M, Yeo B-L, Liu B (1998) Segmentation of video by clustering and graph analysis. J Comput Vis Image Understand 7(1):94–109
The work reported is supported by IWT-SBO project AMASS++ (Advanced Multimedia Alignment and Structured Summarization, IWT 060051) and TOSCA-MP (Task-oriented search and content annotation for media production, FP7-ICT 287532).
About this article
Cite this article
Poulisse, GJ., Patsis, Y. & Moens, MF. Unsupervised scene detection and commentator building using multi-modal chains. Multimed Tools Appl 70, 159–175 (2014). https://doi.org/10.1007/s11042-012-1086-0
- Semantic event detection
- Feature extraction
- Multi-modal scene segmentation
- Video summarization