Journal of Intelligent Information Systems

, Volume 34, Issue 2, pp 135–175 | Cite as

CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents

  • M. Belkhatir


Traditional multimedia (video) retrieval systems use the keyword-based approach in order to make the search process fast although this approach has several shortcomings and limitations related to the way the user is able to formulate her/his information need. Typical Web multimedia retrieval systems illustrate this paradigm in the sense that the result of a search consists of a collection of thousands of multimedia documents, many of which would be irrelevant or not fully exploited by the typical user. Indeed, according to studies related to users’ behavior, an individual is mostly interested in the initial documents returned during a search session and therefore a multimedia retrieval system is to model the multimedia content as precisely as possible to allow for the first retrieved images to be fully relevant to the user’s information need. For this, the keyword-based approach proves to be clearly insufficient and the need for a high-level index and query language, addressing the issue of combining modalities within expressive frameworks for video indexing and retrieval is of huge importance and the only solution for achieving significant retrieval performance. This paper presents a multi-facetted conceptual framework integrating multiple characterizations of the visual and audio contents for automatic video retrieval. It relies on an expressive representation formalism handling high-level video descriptions and a full-text query framework in an attempt to operate video indexing and retrieval beyond trivial low-level processes, keyword-annotation frameworks and state-of-the art architectures loosely-coupling visual and audio descriptions. Experiments on the multimedia topic search task of the TRECVID evaluation campaign validate our proposal.


Video indexing and retrieval Visual/audio integration Conceptual graphs Large-scale experimental validation 


  1. Amato, G., Mainetto, G., & Savino, P. (1998). An approach to a content-based retrieval of multimedia data. Multimedia Tools and Applications, 7(1–2), 9–36.CrossRefGoogle Scholar
  2. Amir, A., Berg, M., & Chang, S.-F. (2003). IBM research TRECVID-2003 video retrieval system. In NIST TRECVID-2003.Google Scholar
  3. Assfalg, J., Bertini, M., Colombo, C., & del Bimbo, A. (2002). Semantic annotation of sports videos. IEEE MultiMedia, 9(2), 52–60.CrossRefGoogle Scholar
  4. Belkhatir, M. (2005). Combining visual semantics and texture characterizations for precision-oriented automatic image retrieval. In Proceedings of ECIR (pp. 457–474).Google Scholar
  5. Belkhatir, M., Mulhem, P., Chiaramella, Y. (2004). Integrating perceptual signal features within a multi-facetted conceptual model for automatic image retrieval. In Proceedings of ECIR (pp. 267–282).Google Scholar
  6. Belkhatir, M., Mulhem, P., & Chiaramella, Y. (2005). A full-text framework for the image retrieval signal/semantic integration. In Proceedings of DEXA 2005 (pp. 113–123).Google Scholar
  7. Berlin, B., & Kay, P. (1991). Basic color terms: Their universality and evolution. Berkeley: University of California Press.Google Scholar
  8. Bertini, M., del Bimbo, A., & Nunziati, W. (2003). Annotation and retrieval of structured video documents. In Proceedings of ECIR (pp. 12–24).Google Scholar
  9. Bhushan, N. A., & Lohse, G. (1997). The texture lexicon: Understanding the categorization of visual texture terms and their relationship to texture images. Cognitive Science, 21(2), 219–246.CrossRefGoogle Scholar
  10. Blei, D., & Jordan, M. (2003). Modeling annotated data. ACM SIGIR, 127–134.Google Scholar
  11. Carneiro, G., et al. (2006). Supervised learning of semantic classes for image annotation and retrieval. IEEE PAMI, 394–410.Google Scholar
  12. Charhad, M., Moraru, D., Ayache, S., & Quenot, G. (2005). Speaker identity indexing in audio-visual documents. In Proceedings of content-based multimedia indexing (CBMI).Google Scholar
  13. Chua, T.-S., et al. (2004). TRECVID 2004 search task by NUS PRIS. In The online proceedings of the TREC video retrieval evaluation. Retrieved from
  14. Cleverdon, C. W., Mills, J., & Keen, E. M. (1966). Factors determining the performance of indexing systems. TR vol. 2: Test results, ASLIB Cranfield Research Project (2).Google Scholar
  15. Cohn, A. (1997). Qualitative spatial representation and reasoning with the region connection calculus. Geoinformatica, 1, 1–44.CrossRefGoogle Scholar
  16. Cox, I., et al. (2000). The Bayesian IR system, PicHunter: Theory, implementation and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20–37.CrossRefGoogle Scholar
  17. Etievent, E., Lebourgeois, F., & Jolion, J. M. (1999). Assisted video sequences indexing: Motion analysis based on interest points. In Proceedings of ICIAP (pp. 27–29).Google Scholar
  18. Fablet, R., & Bouthemy, P. (2000). Statistical motion-based video indexing and retrieval. In Proceedings of the conf. on content-based multimedia information access RIAO (pp. 602–619).Google Scholar
  19. Fan, J., et al. (2004). ClassView: Hierarchical video shot classification, indexing, and accessing. IEEE Transactions on Multimedia, 6(1), 70–86.CrossRefGoogle Scholar
  20. Feng, S. L., Manmatha, R., & Lavrenko, V. (2004). Multiple Bernoulli relevance models for image and video annotation. In Proceedings of CVPR (pp. 1002–1009).Google Scholar
  21. Gauvain, J. L., Lamel, L., & Adda, G. (2002). The LIMSI broadcast news transcription system. Speech Communication, 37, 89–108.CrossRefMATHGoogle Scholar
  22. Gong, Y., Chua, C. H., & Xiaoyi, G. (1996). Image indexing and retrieval based on color histograms. Multimedia Tools and Applications, II, 133–156.Google Scholar
  23. Hollink, L. (2004). Classification of user image descriptions. International Journal of Human–Computer Studies, 61(5), 601–626.CrossRefGoogle Scholar
  24. Ianeva, T. (2004). Probabilistic approaches to video retrieval. In The online proceedings of the TREC video retrieval evaluation. Retrieved from
  25. Iyengar, G., et al. (2005). Joint visual-text modeling for automatic retrieval of multimedia documents. In Proceedings of ACM MM (pp. 21–30).Google Scholar
  26. Jiang, H., Montesi, D., & Elmagarmid, A. K. (1999). Integrated video and text for content-based access to video databases. Multimedia Tools and Applications, 9(3), 227–249.CrossRefGoogle Scholar
  27. Jin, Y., et al. (2005). Image annotations by combining multiple evidence & wordNet. In Proceedings of ACM MM (pp. 706–715).Google Scholar
  28. Kemp, T., Schmidt, M., Westphal, M., & Waibel, A. (2000). Strategies for automatic segmentation of audio data. In Proceedings of ICASSP (pp. 1423–1426).Google Scholar
  29. Kennedy, L. S., Natsev, A., & Chang, S.-F. (2005). Automatic discovery of query-class-dependent models for multimodal search. In Proceedings of ACM Multimedia (pp. 24–28).Google Scholar
  30. Kwon, S., & Narayanan, S. (2002). Speaker change detection using a new weighted distance measure. In Proceedings of int’l conf. spoken language processing (ICSLP) (pp. 2537–2540).Google Scholar
  31. Lim, J. H., & Jin, J. S. (2005). A structured learning framework for content-based image indexing and visual query. Multimedia Systems, 10(4), 317–331.CrossRefGoogle Scholar
  32. Lin, P.-C., Wang, J.-C., Wang, J.-F., & Sung, H.-C. (2007). Unsupervised speaker change detection using SVM training misclassification rate. IEEE Transactions on Computers, 56(9), 1234–1244.MathSciNetGoogle Scholar
  33. Liu, J., et al. (2007). Dual cross-media relevance model for image annotation. In Proceedings of ACM MM (pp. 605–614).Google Scholar
  34. Lu, Y., et al. (2000). A unified framework for semantics and feature based RF in image retrieval systems. In Proceedings of ACM MM (pp. 31–37).Google Scholar
  35. Martinet, J., Mulhem, P., & Chiaramella, Y. (2005). A model for weighting image objects in home photographs. In Proceedings of CIKM (pp. 760–767).Google Scholar
  36. Mittal, A., & Cheong, L. F. (2003). Framework for synthesizing semantic-level indices. Multimedia Tools and Applications, 20(2), 135–158.CrossRefGoogle Scholar
  37. Miyahara, M., & Yoshida, Y. (1988). Mathematical transform of (R,G,B) color data to munsell (H,V,C) color data. In Proceedings of SPIE-visual communications and image processing (pp. 650–657).Google Scholar
  38. Mojsilovic, A., & Rogowitz, B. (2001). Capturing image semantics with low-level descriptors. In Proceedings of IEEE ICIP (pp. 18–21).Google Scholar
  39. Mulhem, P., Lim, J. H., Leow, W. K., & Kankanhalli, M. (2003). Advances in digital home image albums (chapter IX, pp. 201–226). Multimedia Systems and Content-Based Image Retrieval, Idea Publishing.Google Scholar
  40. Naphade, M. R., & Huang, T. S. (2002). Factor graph framework for semantic video indexing. IEEE Transactions on Circuits and Systems for Video Technology, 12(1), 40–52.CrossRefGoogle Scholar
  41. Natsev, A., Naphade, M., & Tesic, J. (2005). Learning the semantics of multimedia queries and concepts from a small number of examples. In Proceedings of ACM MM (pp. 598–607).Google Scholar
  42. Neo, S. Y., et al. (2006). Video retrieval using high-level features: Exploiting query matching and confidence-based weighting. In Proceedings of CIVR.Google Scholar
  43. Ounis, I., & Pasca, M. (1998). RELIEF: Combining expressiveness and rapidity into a single system. In Proceedings of ACM SIGIR (pp. 266–274).Google Scholar
  44. Platt, J. C. (1999). Probabilities for support vector machines. In Advances in large margin classifiers (pp. 61–74). Cambridge, MA: MIT.Google Scholar
  45. Quénot, G. (2001). TREC-10 shot boundary detection task: CLIPS system description and evaluation. In Proceedings of TREC (pp. 13–16).Google Scholar
  46. Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and TRECVid. In Proceeding of the multimedia information retrieval workshop (pp. 321–330).Google Scholar
  47. Smeulders, A., et al. (2000). Content-based image retrieval at the end of the early years. IEEE PAMI, 22(12), 1349–1380.Google Scholar
  48. Snoek, S., et al. (2006). The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1678–1689.CrossRefGoogle Scholar
  49. Sowa, J. F. (1984). Conceptual structures: Information processing in mind and machine. Reading, MA: Addison-Wesley.MATHGoogle Scholar
  50. Srikanth, M., et al. (2005). Exploiting ontologies for automatic image annotation. In Proceedings of ACM SIGIR (pp. 1349–1380).Google Scholar
  51. Town, C. P., & Sinclair, D. (2000). Content-based image retrieval using semantic visual categories. TR2000-14, AT&T Labs Cambridge.Google Scholar
  52. Van Rijsbergen, C. J. (1986). A non-classical logic for information retrieval. Computer Journal, 29(6), 481–485.CrossRefMATHGoogle Scholar
  53. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.MATHGoogle Scholar
  54. Westerveld, T., & de Vries, A. P. (2003). Experimental evaluation of a generative probabilistic image retrieval model on ‘easy’ data. SIGIR Multimedia Information Retrieval Workshop.Google Scholar
  55. Westerveld, T., et al. (2003). Combining infomation sources for video retrieval: The lowlands team at TRECVID 2003. In NIST TRECVID-2003.Google Scholar
  56. Yan, R., Yang, J., & Hauptmann, A. G. (2004). Learning query-class dependent weights in automatic video retrieval. In Proceedings of ACM MM (pp. 270–278).Google Scholar
  57. Yang, J., Chen, M. Y., & Hauptmann, A. G. (2004). Finding person X: Correlating names with visual appearances. In Proceedings of CIVR (pp. 270–278).Google Scholar
  58. Zhou, X. S., & Huang, T. S. (2002). Unifying keywords and visual contents in image retrieval. IEEE Multimedia, 9(2), 23–33.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Center for Multimedia Computing, Communications and Applications ResearchMonash UniversitySunwayMalaysia

Personalised recommendations