Skip to main content
Log in

Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A class of audio-visual data (fiction entertainment: movies, TV series) is segmented into scenes, which contain dialogs, using a novel hidden Markov model-based (HMM) method. Each shot is classified using both audio track (via classification of speech, silence and music) and visual content (face and location information). The result of this shot-based classification is an audio-visual token to be used by the HMM state diagram to achieve scene analysis. After simulations with circular and left-to-right HMM topologies, it is observed that both are performing very good with multi-modal inputs. Moreover, for circular topology, the comparisons between different training and observation sets show that audio and face information together gives the most consistent results among different observation sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R.M. Bolle, B.-L. Yeo, and M.M. Yeung, “Video query: research directions,” IBM Journal of Research and Development, Vol. 42, pp. 233–252, 1998. (also avaiable at http://www.almaden.ibm.com/journal/ rd/422/bolle.txt).

    Google Scholar 

  2. J.S. Boreczky and L.D. Wilcox, “A hidden Markov model framework for video segmentation audio and image features,” in Proceedings of ICASSP'98, 1998, pp. 3741–3744.

  3. S. Eickler and G. Rigoll, “Continuous online gesture recognition based on hidden Markov models,” in Proceedings of ICPR'98, 1998, pp. 1206–1208.

  4. M. Ferman and A.M. Tekalp, “Probabilistic analysis and extraction of video content,” in Proceedings of ICIP'99, 1999.

  5. J. Huang, Z. Liu, and Y. Wang, “Integration of audio and visual information for content-based video segmentation,” in Proceedings of ICIP'98, 1998.

  6. Q. Huang, Z. Liu, A. Rosenberg, D. Gibbon, and B. Shahraray, “Automated generation of new content hierarchy by integrating audio, video and text information,” in Proceedings of ICASSP'99, 1999, pp. 3025–3028.

  7. R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Video abstracting,” Communications of ACM, Vol. 40, No. 12, pp. 55–62, 1997.

    Google Scholar 

  8. J. Nam, A.E. Cetin, and A.H. Tewfik, “Speaker identification and video shot analysis for hierarchical video shot classification,” in Proceedings of ICIP'97, 1997.

  9. J. Nam, M. Alghoneiemy, and A.H. Tewfik, “Audio-visual content-based violent scene characterization,” in Proceedings of ICIP'98, 1998, pp. 353–357.

  10. A.V. Nefian and M.H. Hayes III, “An embedded HMM-based approach for face detection and recognition,” in Proceedings of ICASSP'99, 1999, pp. 3553–3556.

  11. H. Pan, Z.-P. Liang, T.J. Anastasio, and T.S. Huang, “A hybrid NN-bayesian architecture for information fusion,” in Proceedings of ICIP'98, 1998, pp. 368–371.

  12. L.R. Rabiner and B-H. Juang, Fundementals of Speech Recognition. Prentice Hall. Englewood Cliffs, NJ, USA, 1993.

    Google Scholar 

  13. C. Saraceno and R. Leonardi, “Identification of story units in audio-visual sequences by joint audio and video processing,” in Proceedings of ICIP'98, 1998, pp. 363–367.

  14. S. Tsekeridou and I. Pitas, “Speaker dependent video indexing based on audio-visual interaction,” in Proceedings of ICIP'98, 1998, pp. 358–362.

  15. N. Vasconcelos and A. Lippman, “Towards semantically meaningful feature spaces for the characterization of video content,” in Proceedings of ICIP'97, 1997.

  16. W. Wolf, “Hidden Markov model parsing of video programs,” in Proceedings of ICASSP'97, 1997, pp. 2609–2611.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alatan, A.A., Akansu, A.N. & Wolf, W. Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing. Multimedia Tools and Applications 14, 137–151 (2001). https://doi.org/10.1023/A:1011395131992

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011395131992

Navigation