Multimedia Tools and Applications

, Volume 41, Issue 3, pp 337–373 | Cite as

Concept detection and keyframe extraction using a visual thesaurus

  • Evaggelos Spyrou
  • Giorgos Tolias
  • Phivos Mylonas
  • Yannis Avrithis


This paper presents a video analysis approach based on concept detection and keyframe extraction employing a visual thesaurus representation. Color and texture descriptors are extracted from coarse regions of each frame and a visual thesaurus is constructed after clustering regions. The clusters, called region types, are used as basis for representing local material information through the construction of a model vector for each frame, which reflects the composition of the image in terms of region types. Model vector representation is used for keyframe selection either in each video shot or across an entire sequence. The selection process ensures that all region types are represented. A number of high-level concept detectors is then trained using global annotation and Latent Semantic Analysis is applied. To enhance detection performance per shot, detection is employed on the selected keyframes of each shot, and a framework is proposed for working on very large data sets.


Concept detection Keyframe extraction Visual thesaurus Region types 



This work was partially supported by the European Commission under contracts FP7-215453 WeKnowIt, FP6-027026 K-Space and FP6-027685 MESH.


  1. 1.
    Avrithis Y, Doulamis A, Doulamis N, Kollias S (1999) A stochastic framework for optimal key frame extraction from mpeg video databases. Comput Vis Image Underst 5(1):3–24CrossRefGoogle Scholar
  2. 2.
    Ayache S, Quenot G (2007) TRECVID 2007 collaborative annotation using active learning. In: TRECVID 2007 workshop, Gaithersburg, 5–6 November 2007Google Scholar
  3. 3.
    Boujemaa N, Fleuret F, Gouet V, Sahbi H (2004) Visual content extraction for automatic semantic annotation of video news. In: IS&T/SPIE conference on storage and retrieval methods and applications for multimedia, part of electronic imaging symposium, San Jose, January 2004Google Scholar
  4. 4.
    Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines.∼cjlin/libsvm
  5. 5.
    Chang SF, Sikora T, Puri A (2001) Overview of the MPEG-7 standard. IEEE Trans Circuits Systems Video Technol 11(6):688–695CrossRefGoogle Scholar
  6. 6.
    Chapelle O, Haffner P, Vapnik V (1999) Support vector machines for histogram-based image classification. IEEE Trans Neural Netw 10(5):1055–1064CrossRefGoogle Scholar
  7. 7.
    Chiu S (1997) Extracting fuzzy rules from data for function approximation and pattern classification. In: Dubois D, Prade H, Yager R (eds) Fuzzy information engineering: a guided tour of applications. Wiley, New YorkGoogle Scholar
  8. 8.
    Cooper M, Foote J (2005) Discriminative techniques for keyframe selection. In: Proceedings of the IEEE international conference on multimedia & expo (ICME), Amsterdam, 6–9 July 2005Google Scholar
  9. 9.
    Dance C, Willamowski J, Fan L, Bray C, Csurka G (2004) Visual categorization with bags of keypoints. In: ECCV—international workshop on statistical learning in computer visionGoogle Scholar
  10. 10.
    Deerwester S, Dumais S, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Soc Inf Sci 41(6):391–407CrossRefGoogle Scholar
  11. 11.
    Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2007) The PASCAL visual object classes challenge 2007 (VOC2007) results.
  12. 12.
    Gokalp D, Aksoy S (2007) Scene classification using bag-of-regions representations. In: IEEE conference on computer vision and pattern recognition (CVPR), Minneapolis, 18–23 June 2007Google Scholar
  13. 13.
    Haykin S (1998) Neural networks: a comprehensive foundation. Prentice Hall, Englewood CliffsGoogle Scholar
  14. 14.
    IBM (2005) MARVEL multimedia analysis and retrieval system. IBM Research White PaperGoogle Scholar
  15. 15.
    Kishida K (2005) Property of average precision and its generalization: an examination of evaluation indicator for information retrieval. NII Technical Reports, NII-2005-014EGoogle Scholar
  16. 16.
    Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic—theory and applications. Prentice Hall, Englewood CliffsMATHGoogle Scholar
  17. 17.
    Laaksonen J, Koskela M, Oja E (2002) Picsom, self-organizing image retrieval with MPEG-7 content descriptors. IEEE Trans Neural Netw 13:841–853CrossRefGoogle Scholar
  18. 18.
    Lazebnik S, Schmid C, Ponce J (2006) A discriminative framework for texture and object recognition using local image features. In: Towards category-level object recognition. Springer, New York, pp 423–442CrossRefGoogle Scholar
  19. 19.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  20. 20.
    Ma YF, Lu L, Zhang HJ, Li M (2002) A user attention model for video summarization. In: MULTIMEDIA ’02: Proceedings of the tenth ACM international conference on multimedia. ACM, New York, pp 533–542CrossRefGoogle Scholar
  21. 21.
    Manjunath B, Ohm J, Vasudevan V, Yamada A (2001) Color and texture descriptors. IEEE Trans Circuits Syst Video Technol 11(6):703–715CrossRefGoogle Scholar
  22. 22.
    Mérialdo B, Huet B, Yahiaoui I, Souvannavong F (2002) Automatic video summarization. In: International thyrrenian workshop on digital communications, advanced methods for multimedia signal processing, Palazzo dei Congressi, Capri, 8–11 September 2002Google Scholar
  23. 23.
    Mitchell M (1998) An introduction to genetic algorithms. MIT, CambridgeMATHGoogle Scholar
  24. 24.
    Molina J, Spyrou E, Sofou N, Martinez JM (2007) On the selection of MPEG-7 visual descriptors and their level of detail for nature disaster video sequences classification. In: 2nd international conference on semantics and digital media technologies (SAMT), Genova, 5–7 December 2007Google Scholar
  25. 25.
    Morris OJ, Lee MJ, Constantinides AG (1986) Graph theory for image analysis: an approach based on the shortest spanning tree. IEE Proc 133:146–152Google Scholar
  26. 26.
    Naphade MR, Kennedy L, Kender JR, Chang SF, Smith JR, Over P, Hauptmann A (2005) A light scale concept ontology for multimedia understanding for TRECVID 2005. IBM Research Technical ReportGoogle Scholar
  27. 27.
    Natsev A, Naphade M, Smith J (2003) Lexicon design for semantic indexing in media databases. In: International conference on communication technologies and programming, Varna, 23–26 June 2003Google Scholar
  28. 28.
    Opelt A, Pinz A, Zisserman A (2006) Incremental learning of object detectors using a visual shape alphabet. In: IEEE computer society conference on computer vision and pattern recognition, New York, 17–22 June 2006Google Scholar
  29. 29.
    Russell BC, Torralba A, Murphy KP, Freeman WT (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vis 77:157–173CrossRefGoogle Scholar
  30. 30.
    Saux BL, Amato G (2004) Image classifiers for scene analysis. In: International conference on computer vision and graphics, Warsaw, 22–24 September 2004Google Scholar
  31. 31.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: MIR ’06: proceedings of the 8th ACM international workshop on multimedia information retrieval. ACM, New YorkGoogle Scholar
  32. 32.
    Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380CrossRefGoogle Scholar
  33. 33.
    Snoek CGM, Worring M (2003) Time interval based modelling and classification of events in soccer video. In: Proceedings of the 9th annual conference of the advanced school for computing and imaging (ASCI), Heijen, June 2003Google Scholar
  34. 34.
    Souvannavong F, Mérialdo B, Huet B (2005) Region-based video content indexing and retrieval. In: CBMI 2005, fourth international workshop on content-based multimedia indexing, Riga, 21–23 June 2005Google Scholar
  35. 35.
    Spyrou E, Avrithis Y (2007) A region thesaurus approach for high-level concept detection in the natural disaster domain. In: 2nd international conference on semantics and digital media technologies (SAMT), Genova, December 2007Google Scholar
  36. 36.
    Spyrou E, Avrithis Y (2007) Keyframe extraction using local visual semantics in the form of a region thesaurus. In: 2nd international workshop on semantic media adaptation and personalization (SMAP), London, 17–18 December 2007Google Scholar
  37. 37.
    Spyrou E, LeBorgne H, Mailis T, Cooke E, Avrithis Y, O’Connor N (2005) Fusing MPEG-7 visual descriptors for image classification. In: International conference on artificial neural networks (ICANN), Warsaw, 11–15 September 2005Google Scholar
  38. 38.
    Spyrou E, Tolias G, Mylonas P, Avrithis Y (2008) A semantic multimedia analysis approach utilizing a region thesaurus and LSA. In: International workshop on image analysis for multimedia interactive services (WIAMIS), Klagenfurt, 7–9 May 2008Google Scholar
  39. 39.
    Sundaram H, Chang SF (2003) Video analysis and summarization at structural and semantic levels, multimedia information retrieval and management: technological fundamentals and applications. In: Feng D, Siu WC, Zhang H (Eds) Springer, New YorkGoogle Scholar
  40. 40.
    Vapnik V (1995) The nature of statistical learning theory. Springer, New YorkMATHGoogle Scholar
  41. 41.
    Voisine N, Dasiopoulou S, Mezaris V, Spyrou E, Athanasiadis T, Kompatsiaris I, Avrithis Y, Strintzis MG (2005) Knowledge-assisted video analysis using a genetic algorithm. In: 6th international workshop on image analysis for multimedia interactive services (WIAMIS 2005), Montreux, 13–15 April 2005Google Scholar
  42. 42.
    Yamada A, Pickering M, Jeannin S, Cieplinski L, Ohm J, Kim M (2001) MPEG-7 Visual part of eXperimentation model version 9.0Google Scholar
  43. 43.
    Yanagawa A, Chang SF, Kennedy L, Hsu W (2007) Columbia universitys baseline detectors for 374 LSCOM semantic visual concepts. Columbia University ADVENT Technical ReportGoogle Scholar
  44. 44.
    Yuan J, Guo Z et al (2007) THU and ICRC at TRECVID 2007. In: 5th TRECVID workshop, Gaithersburg, November 2007Google Scholar
  45. 45.
    Zhang H, Wu J, Zhong D, Smoliar S (1997) An integrated system for content-based retrieval and browsing. Pattern Recogn 30:643–658CrossRefGoogle Scholar
  46. 46.
    Zhuang Y, Rui Y, Huang T, Mehrotra S (1998) Adaptive keyframe extraction using unsupervised clustering. In: Proc of international conference oh image processing (ICIP), Chicago, 4–7 October 1998Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Evaggelos Spyrou
    • 1
  • Giorgos Tolias
    • 1
  • Phivos Mylonas
    • 1
  • Yannis Avrithis
    • 1
  1. 1.School of Electrical and Computer EngineeringNational Technical University of AthensAthensGreece

Personalised recommendations