Multimedia Tools and Applications

, Volume 66, Issue 3, pp 545–572 | Cite as

Video content categorization using the double decomposition



Video contents contain complex structures due to the variety of the components and events involved. For example, surveillance videos often record multi-object interactions and consist of various scales of motion detail; Web videos are composed of multimodal cues, and each cue generally consists of a variety of scales of information. Generally, video contents comprise two types of the combination of the inherent structures: multi-modality/multi-scale and multi-object /multi-scale. Therefore, in this paper, we propose a new framework for video content modeling, under which video contents are decomposed into multiple interacting processes by double decomposition that aims at each type of combination of structures. To model the resulting processes, we propose a method named double-decomposed hidden Markov models (DDHMMs). DDHMMs contain multiple state chains that correspond to the interacting processes. To make the switching frequency of states in each chain consistent with the scale of the corresponding process, a durational state variable is introduced in DDHMMs. The proposed method performs well in modeling the relations among the interacting processes and the dynamics of each. We discuss the appropriate features under the proposed framework and evaluate DDHMMs in two applications, human motion recognition and web video categorization. The experimental results demonstrate that the double decomposition enhances video categorization performance in both cases.


Video content categorization Double decomposition Dynamic Bayesian network Multiple scales Stochastic process 



The research presented in this paper is supported in part by the National Natural Science Foundation (60905018, 60903121, 61173109, 61175039), Key Projects in the National Science & Technology Pillar Program (2011BAK08B02), Research Fund for Doctoral Program of Higher Education (20090201120032), Fundamental Research Funds for the Central Universities (xjj2009041, xjj20100051), of China. The authors would like to thank the video team at United Technologies Research Center (UTRC) for their pertinent and constructive discussion, and thank Dr. K.P. Murphy for his Matlab Bnet toolbox. Also, the authors would like to thank all the anonymous reviewers for their constructive advices.


  1. 1.
    Brand M, Oliver N, Pentland A (1997) Coupled hidden Markov models for complex action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 994–999Google Scholar
  2. 2.
    Brezeale D, Cook DJ (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybern C 38:416–430CrossRefGoogle Scholar
  3. 3.
    Chen C, Liang J, Zhu X (2011) Gait recognition based on improved dynamic Bayesian networks. Pattern Recogn 44:988–995CrossRefGoogle Scholar
  4. 4.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 886–893Google Scholar
  5. 5.
    Duong TV, Bui HH, Phung DQ, Venkatesh S (2005) Activity recognition and abnormality detection with the switching hidden semi-Markov model. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 838–845Google Scholar
  6. 6.
    Fine S, Singer Y, Tishby N (1998) The hierarchical hidden Markov model: analysis and applications. Mach Learn 32:41–62MATHCrossRefGoogle Scholar
  7. 7.
    Forney GD (1973) The Viterbi algorithm. P IEEE 61:268–278MathSciNetCrossRefGoogle Scholar
  8. 8.
    Ghahramani Z, Jordan MI (1997) Factorial hidden Markov models. Mach Learn 29:245–273MATHCrossRefGoogle Scholar
  9. 9.
    Gu J, Ding X, Wang S, Wu Y (2010) Action and gait recognition from recovered 3-D human joints. IEEE Trans Syst Man Cybern B 40:1021–1033CrossRefGoogle Scholar
  10. 10.
    Huang CL, Shih HC, Chao CY (2006) Semantic analysis of soccer video using dynamic Bayesian network. IEEE Trans Multimedia 8:749–760CrossRefGoogle Scholar
  11. 11.
    Junejo IN (2010) Using dynamic Bayesian network for scene modeling and anomaly detection. Signal Image Video P 4:1–10MATHCrossRefGoogle Scholar
  12. 12.
    Liu X, Chua CS (2006) Multi-agent activity recognition using observation decomposed hidden Markov models. Image Vis Comput 24:166–175MATHCrossRefGoogle Scholar
  13. 13.
    Liu Y, Wu F (2009) Multi-modality video shot clustering with tensor representation. Multimed Tools Appl 41(1):93–109CrossRefGoogle Scholar
  14. 14.
    Manohar V, Tsakalidis S, Natarajan P, et al (2011) Audio-visual fusion using bayesian model combination for web video retrieval. In: Proceddings of ACM conference on multimedia, pp 1537–1540Google Scholar
  15. 15.
    Mitchell C, Harper M, Jamieson L (1999) On the complexity of explicit duration HMMs. IEEE Trans Speech Audio Process 3(3):213–217CrossRefGoogle Scholar
  16. 16.
    Murphy KP (2002) Dynamic Bayesian network: representation, inference and learning. Ph.D Thesis, University of California, BerkeleyGoogle Scholar
  17. 17.
    Natarajan P, Nevatia R (2007) Coupled hidden semi-Markov models for activity recognition. In: Proceedings of IEEE workshop on motion and video computing, pp 10–17Google Scholar
  18. 18.
    Nefian AV, Liang L, Pi X, et al (2002) A coupled HMM for audio-visual speech recognition. In: Proceedings of ICASSP, pp 2013–2016Google Scholar
  19. 19.
    Niebles JC, Chen C, Li F (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Proceddings of ECCV, pp 392–405Google Scholar
  20. 20.
    Oliver N, Garg A, Horvitz E (2004) Layered representations for learning and inferring office activity from multiple sensory channels. Comput Vis Image Underst 96(2):163–180CrossRefGoogle Scholar
  21. 21.
    Roach MJ, Mason JSD, Pawlewski M (2001) Video genre classification using dynamics. In: Proceedings of ICASSP, pp 1557–1560Google Scholar
  22. 22.
    Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326CrossRefGoogle Scholar
  23. 23.
    Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of ACM international conference on multimedia, pp 399–402Google Scholar
  24. 24.
    Tan BT, Fu M, Spray A, Dermody P (1996) The use of wavelet transforms in phoneme recognition. In: Proceedings of international conference on spoken language, pp 2431–2434Google Scholar
  25. 25.
    Wang M, Hua X, Yuan X, Song Y, et al (2007) Optimizing multi-graph learning: towards a unified video annotation scheme. In: Proceedings of ACM international conference on multimedia, pp 862–871Google Scholar
  26. 26.
    Wang L, Zhou H, Low S, Leckie C (2009) Action recognition via multi-feature fusion and gaussian process classification. In: Proceedings of workshop on applications of computer vision, pp 1–6Google Scholar
  27. 27.
    Wu Y, Chang EY, Chang KCC, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. In: Proceedings of ACM international conference on multimedia, pp 572–579Google Scholar
  28. 28.
    Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using Hidden markov model. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 379–385Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Ministry of Education Key Lab for Intelligent Networks and Network SecurityXi’an Jiaotong UniversityXi’anChina
  2. 2.Department of AutomationTsinghua UniversityBeijingChina
  3. 3.School of Electronic and Information EngineeringXi’an Jiaotong UniversityXi’anChina

Personalised recommendations