Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing

Alatan, A. Aydin; Akansu, Ali N.; Wolf, Wayne

doi:10.1023/A:1011395131992

Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing

Published: June 2001

Volume 14, pages 137–151, (2001)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

A. Aydin Alatan¹,
Ali N. Akansu^2,3 &
Wayne Wolf⁴

212 Accesses
36 Citations
Explore all metrics

Abstract

A class of audio-visual data (fiction entertainment: movies, TV series) is segmented into scenes, which contain dialogs, using a novel hidden Markov model-based (HMM) method. Each shot is classified using both audio track (via classification of speech, silence and music) and visual content (face and location information). The result of this shot-based classification is an audio-visual token to be used by the HMM state diagram to achieve scene analysis. After simulations with circular and left-to-right HMM topologies, it is observed that both are performing very good with multi-modal inputs. Moreover, for circular topology, the comparisons between different training and observation sets show that audio and face information together gives the most consistent results among different observation sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Microsoft COCO: Common Objects in Context

Emotion detection from text and speech: a survey

Article 07 April 2018

Kashfia Sailunaz, Manmeet Dhaliwal, … Reda Alhajj

SingDistVis: interactive Overview+Detail visualization for F0 trajectories of numerous singers singing the same song

Article Open access 10 April 2024

Takayuki Itoh, Tomoyasu Nakano, … Masataka Goto

References

R.M. Bolle, B.-L. Yeo, and M.M. Yeung, “Video query: research directions,” IBM Journal of Research and Development, Vol. 42, pp. 233–252, 1998. (also avaiable at http://www.almaden.ibm.com/journal/ rd/422/bolle.txt).
Google Scholar
J.S. Boreczky and L.D. Wilcox, “A hidden Markov model framework for video segmentation audio and image features,” in Proceedings of ICASSP'98, 1998, pp. 3741–3744.
S. Eickler and G. Rigoll, “Continuous online gesture recognition based on hidden Markov models,” in Proceedings of ICPR'98, 1998, pp. 1206–1208.
M. Ferman and A.M. Tekalp, “Probabilistic analysis and extraction of video content,” in Proceedings of ICIP'99, 1999.
J. Huang, Z. Liu, and Y. Wang, “Integration of audio and visual information for content-based video segmentation,” in Proceedings of ICIP'98, 1998.
Q. Huang, Z. Liu, A. Rosenberg, D. Gibbon, and B. Shahraray, “Automated generation of new content hierarchy by integrating audio, video and text information,” in Proceedings of ICASSP'99, 1999, pp. 3025–3028.
R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Video abstracting,” Communications of ACM, Vol. 40, No. 12, pp. 55–62, 1997.
Google Scholar
J. Nam, A.E. Cetin, and A.H. Tewfik, “Speaker identification and video shot analysis for hierarchical video shot classification,” in Proceedings of ICIP'97, 1997.
J. Nam, M. Alghoneiemy, and A.H. Tewfik, “Audio-visual content-based violent scene characterization,” in Proceedings of ICIP'98, 1998, pp. 353–357.
A.V. Nefian and M.H. Hayes III, “An embedded HMM-based approach for face detection and recognition,” in Proceedings of ICASSP'99, 1999, pp. 3553–3556.
H. Pan, Z.-P. Liang, T.J. Anastasio, and T.S. Huang, “A hybrid NN-bayesian architecture for information fusion,” in Proceedings of ICIP'98, 1998, pp. 368–371.
L.R. Rabiner and B-H. Juang, Fundementals of Speech Recognition. Prentice Hall. Englewood Cliffs, NJ, USA, 1993.
Google Scholar
C. Saraceno and R. Leonardi, “Identification of story units in audio-visual sequences by joint audio and video processing,” in Proceedings of ICIP'98, 1998, pp. 363–367.
S. Tsekeridou and I. Pitas, “Speaker dependent video indexing based on audio-visual interaction,” in Proceedings of ICIP'98, 1998, pp. 358–362.
N. Vasconcelos and A. Lippman, “Towards semantically meaningful feature spaces for the characterization of video content,” in Proceedings of ICIP'97, 1997.
W. Wolf, “Hidden Markov model parsing of video programs,” in Proceedings of ICASSP'97, 1997, pp. 2609–2611.

Download references

Author information

Authors and Affiliations

Electrical-Electronics Engineering Department, Middle East Technical University, Balgat, Ankara, 06531, Turkey
A. Aydin Alatan
New Jersey Center for Multimedia Research, New Jersey
Ali N. Akansu
Institute of Technology, University Heights, Newark, NJ, 07102, USA
Ali N. Akansu
Department of Electrical Engineering, Princeton University, Princeton, NJ, 08544-5263, USA
Wayne Wolf

Authors

A. Aydin Alatan
View author publications
You can also search for this author in PubMed Google Scholar
Ali N. Akansu
View author publications
You can also search for this author in PubMed Google Scholar
Wayne Wolf
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alatan, A.A., Akansu, A.N. & Wolf, W. Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing. Multimedia Tools and Applications 14, 137–151 (2001). https://doi.org/10.1023/A:1011395131992

Download citation

Issue Date: June 2001
DOI: https://doi.org/10.1023/A:1011395131992

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Emotion detection from text and speech: a survey

SingDistVis: interactive Overview+Detail visualization for F0 trajectories of numerous singers singing the same song

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Emotion detection from text and speech: a survey

SingDistVis: interactive Overview+Detail visualization for F0 trajectories of numerous singers singing the same song

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation