Skip to main content

Unsupervised scene detection and commentator building using multi-modal chains


This paper presents a novel unsupervised method for identifying the semantic structure in long semi-structured video streams. We identify chains, i.e., local clusters of repeated features from both the video stream and audio transcripts. Each chain serves as an indicator that the temporal interval it demarcates is part of the same semantic event. By layering all the chains over each other, dense regions emerge from the overlapping chains, from which we can identify the semantic structure of the video. We present two clustering strategies that accomplish this task, and compare them against a baseline Scene Transition Graph approach. We then develop a commentator that provides a semantic labeling of the resultant video segmentation.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    Matlab: Savitzky-Golay filter, which is a moving average with filter coefficients determined by an unweighted linear least-squares regression and a polynomial model of specified degree (degree 7 used here)

  2. 2.

  3. 3.

  4. 4.


  1. 1.

    Amir A, Argillander J, Berg M, Chang S-F et al (2004) IBM Research TRECVID-2004 Video Retrieval System. In Proceedings of TRECVID

  2. 2.

    Babaguchi N, Nitta N (2003) Intermodal collaboration: a strategy for semantic content analysis for broadcasted sports video. In Proceeding of the International Conference on Video Processing

  3. 3.

    Benini S, Bianchetti A, Leonardi R, Migliorati P (2006) Extraction of significant video summaries by dendrogram analysis. In Proceedings of the International Conference on Image Processing

  4. 4.

    Bertini M, Del Bimbo A, Nunziati W (2005) Common visual cues for sports highlights modeling. Multimed Tool Appl 27:215–218

    Article  Google Scholar 

  5. 5.

    Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  6. 6.

    Finkel JR, Grenager T, Manning CD (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings ACL

  7. 7.

    Hearst MA (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Ling 23(1):33–64

    Google Scholar 

  8. 8.

    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings CVPR

  9. 9.

    Li Y, Narayanan S, Kuo CCJ (2004) Content-based movie analysis and indexing based on audiovisual cues. IEEE Trans Circ Syst Video Tech 14(8):1073–1085

    Article  Google Scholar 

  10. 10.

    Li Y, Lee S-H, Yeh C-H, Kuo C-CJ (2006) Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques. IEEE Signal Process Mag 23(2):79–89

    Article  MATH  Google Scholar 

  11. 11.

    Liu S, Xu M, Li H, Chia L-T, Rajan D (2006) Multimodal semantic analysis and annotation for basketball video. EURASIP J Adv Sig Process 1–13

  12. 12.

    Nastase V, Strube M (2008) Decoding wikipedia categories for knowledge acquisition. Proceedings AAAI 1219–1224

  13. 13.

    Ngo C-W, Ma Y-F, Zhang HJ (2005) Video summarization and scene detection by graph modeling. IEEE Trans Circ Syst Video Tech 15(2):296–305

    Article  Google Scholar 

  14. 14.

    Nitta N, Babaguchi N (2002) Automatic story segmentation of closed-caption text for semantic content analysis of broadcasted sports video. In Proceedings of International Workshop on MM Information Systems, 110–116

  15. 15.

    Patsis Y, Verhelst W (2008) A speech/music/silence/garbage/ classifier for searching and indexing broadcast news material. In Proceedings of Database and Expert Systems Applications, 585–589

  16. 16.

    Poulisse GJ, Moens M-F (2010) Unsupervised scene detection in olympic video using multi-modal chains. In Proceedings of CBMI, 103–108

  17. 17.

    Poulisse GJ, Moens M-F, Dekens T, Deschacht K (2010) News story segmentation in multiple modalities. Multimed Tool Appl 48:3–22

    Article  Google Scholar 

  18. 18.

    Quenot G, Moraru D, Ayache S, Charhad M, Guironnet M, Carminati L, Mulhem P, Gensel J, Pellerin D, Besacier L (2004) CLIPS-LIS-LSR-LABRI experiments at TRECVID 2004. In Proceedings of TRECVID

  19. 19.

    Sadler, DA, O’Connor N (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circ Syst Video Tech 1225–1233

  20. 20.

    Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Trancoso I (2009) Multi-modal scene segmentation using scene transition graphs. In Proceedings of ACM Multimedia 665–668

  21. 21.

    Skorochod’ko EF (1972) Adaptive method of automatic abstracting and indexing. Inf Process 71:1179–1182

    Google Scholar 

  22. 22.

    Vasconcelos N, Lippman A (2000) Statistical models of video structure for content analysis and characterization. IEEE Trans Image Process 9(1):3–19

    Article  Google Scholar 

  23. 23.

    Wang Y, Liu Z, Huang J-C (2002) Multimedia content analysis-using both audio and visual clues. IEEE Signal Process Mag 17(6):12–36

    Article  Google Scholar 

  24. 24.

    Xu C, Wang J, Wan K, Li Y, Duan L (2006) Live sports event detection based on broadcast video and web-casting text. In Proceedings ACM Multimedia

  25. 25.

    Xu C, Zhang Y-F, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355

    Article  Google Scholar 

  26. 26.

    Xu M, Xu C, Duan L (2008) Audio keywords generation for sports video analysis. ACM Trans Multimed Comput Comm Appl 4(2):article 11

    Article  Google Scholar 

  27. 27.

    Yeung M, Yeo B-L, Liu B (1998) Segmentation of video by clustering and graph analysis. J Comput Vis Image Understand 7(1):94–109

    Article  Google Scholar 

Download references


The work reported is supported by IWT-SBO project AMASS++ (Advanced Multimedia Alignment and Structured Summarization, IWT 060051) and TOSCA-MP (Task-oriented search and content annotation for media production, FP7-ICT 287532).

Author information



Corresponding author

Correspondence to Gert-Jan Poulisse.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Poulisse, GJ., Patsis, Y. & Moens, MF. Unsupervised scene detection and commentator building using multi-modal chains. Multimed Tools Appl 70, 159–175 (2014).

Download citation


  • Semantic event detection
  • Feature extraction
  • Multi-modal scene segmentation
  • Video summarization