Multimedia Tools and Applications

, Volume 70, Issue 1, pp 159–175 | Cite as

Unsupervised scene detection and commentator building using multi-modal chains

  • Gert-Jan Poulisse
  • Yorgos Patsis
  • Marie-Francine Moens


This paper presents a novel unsupervised method for identifying the semantic structure in long semi-structured video streams. We identify chains, i.e., local clusters of repeated features from both the video stream and audio transcripts. Each chain serves as an indicator that the temporal interval it demarcates is part of the same semantic event. By layering all the chains over each other, dense regions emerge from the overlapping chains, from which we can identify the semantic structure of the video. We present two clustering strategies that accomplish this task, and compare them against a baseline Scene Transition Graph approach. We then develop a commentator that provides a semantic labeling of the resultant video segmentation.


Semantic event detection Feature extraction Multi-modal scene segmentation Video summarization 



The work reported is supported by IWT-SBO project AMASS++ (Advanced Multimedia Alignment and Structured Summarization, IWT 060051) and TOSCA-MP (Task-oriented search and content annotation for media production, FP7-ICT 287532).


  1. 1.
    Amir A, Argillander J, Berg M, Chang S-F et al (2004) IBM Research TRECVID-2004 Video Retrieval System. In Proceedings of TRECVIDGoogle Scholar
  2. 2.
    Babaguchi N, Nitta N (2003) Intermodal collaboration: a strategy for semantic content analysis for broadcasted sports video. In Proceeding of the International Conference on Video ProcessingGoogle Scholar
  3. 3.
    Benini S, Bianchetti A, Leonardi R, Migliorati P (2006) Extraction of significant video summaries by dendrogram analysis. In Proceedings of the International Conference on Image ProcessingGoogle Scholar
  4. 4.
    Bertini M, Del Bimbo A, Nunziati W (2005) Common visual cues for sports highlights modeling. Multimed Tool Appl 27:215–218CrossRefGoogle Scholar
  5. 5.
    Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  6. 6.
    Finkel JR, Grenager T, Manning CD (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings ACLGoogle Scholar
  7. 7.
    Hearst MA (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Ling 23(1):33–64Google Scholar
  8. 8.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings CVPRGoogle Scholar
  9. 9.
    Li Y, Narayanan S, Kuo CCJ (2004) Content-based movie analysis and indexing based on audiovisual cues. IEEE Trans Circ Syst Video Tech 14(8):1073–1085CrossRefGoogle Scholar
  10. 10.
    Li Y, Lee S-H, Yeh C-H, Kuo C-CJ (2006) Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques. IEEE Signal Process Mag 23(2):79–89CrossRefzbMATHGoogle Scholar
  11. 11.
    Liu S, Xu M, Li H, Chia L-T, Rajan D (2006) Multimodal semantic analysis and annotation for basketball video. EURASIP J Adv Sig Process 1–13Google Scholar
  12. 12.
    Nastase V, Strube M (2008) Decoding wikipedia categories for knowledge acquisition. Proceedings AAAI 1219–1224Google Scholar
  13. 13.
    Ngo C-W, Ma Y-F, Zhang HJ (2005) Video summarization and scene detection by graph modeling. IEEE Trans Circ Syst Video Tech 15(2):296–305CrossRefGoogle Scholar
  14. 14.
    Nitta N, Babaguchi N (2002) Automatic story segmentation of closed-caption text for semantic content analysis of broadcasted sports video. In Proceedings of International Workshop on MM Information Systems, 110–116Google Scholar
  15. 15.
    Patsis Y, Verhelst W (2008) A speech/music/silence/garbage/ classifier for searching and indexing broadcast news material. In Proceedings of Database and Expert Systems Applications, 585–589Google Scholar
  16. 16.
    Poulisse GJ, Moens M-F (2010) Unsupervised scene detection in olympic video using multi-modal chains. In Proceedings of CBMI, 103–108Google Scholar
  17. 17.
    Poulisse GJ, Moens M-F, Dekens T, Deschacht K (2010) News story segmentation in multiple modalities. Multimed Tool Appl 48:3–22CrossRefGoogle Scholar
  18. 18.
    Quenot G, Moraru D, Ayache S, Charhad M, Guironnet M, Carminati L, Mulhem P, Gensel J, Pellerin D, Besacier L (2004) CLIPS-LIS-LSR-LABRI experiments at TRECVID 2004. In Proceedings of TRECVIDGoogle Scholar
  19. 19.
    Sadler, DA, O’Connor N (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circ Syst Video Tech 1225–1233Google Scholar
  20. 20.
    Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Trancoso I (2009) Multi-modal scene segmentation using scene transition graphs. In Proceedings of ACM Multimedia 665–668Google Scholar
  21. 21.
    Skorochod’ko EF (1972) Adaptive method of automatic abstracting and indexing. Inf Process 71:1179–1182Google Scholar
  22. 22.
    Vasconcelos N, Lippman A (2000) Statistical models of video structure for content analysis and characterization. IEEE Trans Image Process 9(1):3–19CrossRefGoogle Scholar
  23. 23.
    Wang Y, Liu Z, Huang J-C (2002) Multimedia content analysis-using both audio and visual clues. IEEE Signal Process Mag 17(6):12–36CrossRefGoogle Scholar
  24. 24.
    Xu C, Wang J, Wan K, Li Y, Duan L (2006) Live sports event detection based on broadcast video and web-casting text. In Proceedings ACM MultimediaGoogle Scholar
  25. 25.
    Xu C, Zhang Y-F, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355CrossRefGoogle Scholar
  26. 26.
    Xu M, Xu C, Duan L (2008) Audio keywords generation for sports video analysis. ACM Trans Multimed Comput Comm Appl 4(2):article 11CrossRefGoogle Scholar
  27. 27.
    Yeung M, Yeo B-L, Liu B (1998) Segmentation of video by clustering and graph analysis. J Comput Vis Image Understand 7(1):94–109CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Gert-Jan Poulisse
    • 1
  • Yorgos Patsis
    • 2
    • 3
  • Marie-Francine Moens
    • 1
  1. 1.Department of Computer ScienceKatholieke Universiteit LeuvenLeuvenBelgium
  2. 2.IBBTGhent-LedebergBelgium
  3. 3.ETRO—DSSPVrije Universiteit BrusselBrusselsBelgium

Personalised recommendations