Capturing Text Semantics for Concept Detection in News Video

  • Gang Wang
  • Tat-Seng Chua
Part of the Signals and Communication Technology book series (SCT)


The overwhelming amounts of multimedia contents have triggered the need for automatic semantic concept detection. However, as there are large variations in the visual feature space, text from automatic speech recognition (ASR) has been extensively used and found to be effective to complement visual features in the concept detection task. Generally, there are two common text analysis methods. One is text classification and the other is text retrieval. Both methods have their own strengths and weaknesses. In addition, fusion of text and visual analysis is still an open problem. In this paper, we present a novel multiresolution, multisource and multimodal (M3) transductive learning framework. We fuse text and visual features via a multiresolution model. This is because different modal features only work well in different temporal resolutions, which exhibit different types of semantics. We perform a multiresolution analysis at the shot, multimedia discourse, and story levels to capture the semantics in a news video. While visual features play a dominant role at the shot level, text plays an increasingly important role as we move from the multimedia discourse towards the story levels. Our multisource inference transductive model provides a solution to combine text classification and retrieval method together. We test our M3 transductive model of semantic concept detection on the TRECVID 2004 dataset. Preliminary results demonstrate that our approach is effective.


Automatic Speech Recognition Semantic Concept Mean Average Precision News Video Text Retrieval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A. Amir et al., ”IBM research TRECVID 2003 video retrieval system”, available at: tv3.papers
  2. 2.
    A. Amir et al., “IBM research TRECVID 2005 video retrieval system”, available at: tv5.papers
  3. 3.
    M. Campbell et al., “IBM Research TRECVID-2006 Video Retrieval System”, Proceedings of TRECVID 2006, Gaithersburg, MD, November 2006 available at: tvpubs
  4. 4.
    S.F. Chang, Advances and Open Issues for Digital Image/Video Search”, Keynote Speech at International Workshop on Image Analysis for Multimedia Interactive Services, available at:<
  5. 5.
    T.S. Chua, S.F. Chang, L. Chaisorn, and W. H. Hsu, “Story Boundary Detection in Large Broadcast News Video Archives-Techniques, Experience and Trends”, Proceedings of the 12th ACM International Conference on Multimedia pp. 656–659, 2004<CrossRefGoogle Scholar
  6. 6.
    T.S. Chua et al., “TRECVID 2004 Search and Feature Extraction Task by NUS PRIS”, Proceedings of (VIDEO) TREC 2004, Gaithersburg, MD, November 2004<Google Scholar
  7. 7.
    H. Cui, K. Li, R. Sun, T.-S. Chua and M.-Y. Kan. National University of Singapore at the TREC-13 Question Answering Main Task. Proceeding of TREC-13, 2004 available at:<
  8. 8.
    P. Duygulu, K.Barnard, Freitas, and D.Forsyth. “Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary” .Proceeding of European Conference on Computer Vision, volume 4 pp. 97–112, 2002<Google Scholar
  9. 9.
    J.L. Gauvain, L. Lamel, and G. Adda, “The LIMSI Broadcast News Transcription System.” Speech Communication, 37 (1–2) pp 89–108, 2002.<MATHCrossRefGoogle Scholar
  10. 10.
    U. Hahn “Topic parsing: accounting for text macro structures in full-text analysis” Information Processing and Management, 26 (1): pp. 135–170, .1990<CrossRefGoogle Scholar
  11. 11.
    A. Hauptmann, et al., “Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video”, Proceedings of (VIDEO) TREC 2003, Gaithersburg, MD, November 2003, available at:
  12. 12.
    M.A. Hearst. “Context and Structure in Automated Full-Text Information Access”. PhD thesis, University of California at Berkeley, 1994.<Google Scholar
  13. 13.
  14. 14.
    J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models”,In proceedings of the 26th Annual International ACM SIGIR Conference pp. 119–126, 2003<Google Scholar
  15. 15.
    D. Jurafsky and J.H. Martin, “Speech and language processing”, published by Prentice-Hall Inc, 2000.<Google Scholar
  16. 16.
    M. Lan, C.L. Tan and H.B. Low “Proposing a new term weighting scheme for text categorization”, Proceedings of the 21st National Conference on Artificial Intelligence, AAAI-2006<Google Scholar
  17. 17.
    Y. Li “Multi-resolution analysis on text segmentation”, Master thesis, National University of Singapore, 2001<Google Scholar
  18. 18.
    C.Y. Lin, B. Tseng, J.R. Smith “Video Collaborative Annotation Forum: Establishing Ground-Truth Labels on Large Multimedia Datasets”, 2003 available at: /tvpubs /<
  19. 19.
    C.Y. Lin, “Robust Automated Topic Identification” Ph.D. Thesis, University of Southern California 1997<Google Scholar
  20. 20.
    Y. Lin, “TMRA-Temporal Multi-resolution Analysis on Video Segmentation”, Master thesis, National University of Singapore, 2000.<Google Scholar
  21. 21.
    A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review”, ACM Computing Surveys, Vol 31, No. 3, pp. 264–323,1999<CrossRefGoogle Scholar
  22. 22.
    J.R. Kender, et al., “IBM Research TRECVID 2004 Video Retrieval System”, Proceedings of (VIDEO) TREC 2004, Gaithersburg, MD, November 2004<Google Scholar
  23. 23.
    Y. Marchenko,T.S. Chua, and R. Jain “Transductive inference using multiple experts for brushwork annotation in paintings domain ”,Proceedings of the 14th ACM Multimedia, pp. 157–160, 2006<.Google Scholar
  24. 24.
    M.R. Naphade and J.R. Smith, “On the detection of semantic concepts at TRECVID”,Proceedings of the 12th ACM Multimedia, pp. 660–667, 2004<CrossRefGoogle Scholar
  25. 25.
    C.D. Paice “Constructing literature abstracts by computer: Techinques and prospects”, Information Processing and Management, 26 (1) pp. 171–186, 1990<Google Scholar
  26. 26.
    G.J. Qi, X.S. Hua, Y. Song, J.H. Tang, and H.J. Zhang, “Transductive Inference with Hierarchical Clustering for Video Annotation” Proceedings of International Conference on Multimedia and Expo, pp. 643–646, 2007<CrossRefGoogle Scholar
  27. 27.
    N.C. Rowe “.Inferring depictions in natural language captions for efficient access to picture data”, Information Process & Management Vol 30 No 3. pp. 379–388,1994<CrossRefGoogle Scholar
  28. 28.
    L.A. Rowe and R. Jain, “ACM SIGMM Retreat Report on Future Directions in Multimedia Research”, ACM Transactions on Multimedia Computing, Communications, and Applications, Vol 1, issues 1, pp. 3–13, 2005CrossRefGoogle Scholar
  29. 29.
    T. Shibata and S. Kurohashi, “Unsupervised topic identification by integrating linguistic and visual information based on Hidden Markov Models”, Proceedings of the International Association for computational linguistics conference pp. 755–762, 2006<Google Scholar
  30. 30.
    M. Slaney, D. Ponceleon, and J. Kaufman, “Multimeida Edges: Finding Hierarchy in all Dimensions”, Proceeding of the 9th International Conference on Multimedia, pp. 29–40,2001Google Scholar
  31. 31.
    C.G.M. Snoek, M. Worring, J.C.V. Gemert, J.M. Geusebroek, and A.W.M. Smeulders, “The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia”, Proceedings of the 14th ACM Multimedia, pp. 421–430, 2006.<CrossRefGoogle Scholar
  32. 32.
    SVMlight, available at:
  33. 33.
    TRECVID (2005–2006): “Online Proceedings of the TRECVID Workshops”, available at<
  34. 34.
    V.N. Vapnik, “Statistical learning theory”, Wiley Interscience New York. pp. 120–200,1998<Google Scholar
  35. 35.
    J.Z. Wang and J. Li, "Learning-Based Linguistic Indexing of Pictures with 2-D MHMMs", Proceedings of the 10th International Conference on Multimedia, pp. 436–445,2002<Google Scholar
  36. 36.
    K.W. Wilson and A. Divakaran, “Broadcast Video Content Segmentation by Supervised learning”,<Google Scholar
  37. 37.
    J. Yang, A. Hauptmann, M.Y. Chen, “Finding Person X: Correlating Names with Visual Appearances”, Proceedings of International Conference on Image and Video Retrieval (CIVR'04),Dublin City University, Ireland, July 21–23, 2004<Google Scholar
  38. 38.
    R.E. Yaniv, and L.Gerzon, “Effective Transductive Learning via PAC-Bayesian Model Selection.”, Technical Report CS-2004-05, IIT, 2004.<Google Scholar
  39. 39.
    J. Yuan et al. “Tsinghua University at TRECVID 2004: Shot Boundary Detection and High-Level Feature Extraction”, • Proceedings of TRECVID 2004, Gaithersburg, MD, November 2004<

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Computer ScienceSchool of Computing, National University of Singapore, Computing 1Singapore 117590

Personalised recommendations