Abstract
We present a robust method to detect and locate a speaker using a joint analysis of speech sound and video image. First, the short speech sound data is analyzed to estimate the rate of spoken syllables, and a difference image is formed using the optimal frame distance derived from the rate to detect the candidates of mouth. Then, they are tracked to positively prove that one of the candidates is the mouth; the rate of mouth movements is estimated from the brightness change profiles for the first candidate and, if both the rates agree, the three brightest parts are detected in the resulting difference image as mouth and eyes. If not, the second candidate is tracked and so on. The first-order moment of the power spectrum of the brightness change profile and the lateral shifts in the tracking are also used to check whether or not they are facial parts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chellappa, R., Wilson, C.L., Sirohey, S.: Human and Machine Recognition of Faces: A Survey. Proc. IEEE 83, 705–740 (1995)
Yang, M.-H., Kriegman, D.J., Ahuja, N.: Detecting Faces in Images: A Survey. IEEE Trans. PAMI 24, 34–58 (2002)
Rowley, H.A., Baluja, S., Kanade, T.: Neural Network-Based Face Detection. IEEE Trans. PAMI 20, 23–38 (1998)
Schneiderman, H., Kanade, T.: Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition. CVPR, 45–51 (1998)
Osuna, E., Freund, R., Girosi, G.: Training Support Vector Machines: An Application to Face Detection. In: CVPR, pp. 130–136 (1997)
Turk, M.A., Pentland, A.P.: Eigenfaces for pattern recognition. J. Cognitive Neuroscience 3, 71–96 (1991)
Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. CVPR 1, 511–518 (2001)
Fröba, B., Ernst, A., Küblbeck, C.: Real-Time Face Detection. In: Proc. 4 th IASTED Signal and Image Processing, pp. 497–502 (2002)
Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. J. Jpn. Soc. Artificial Intelligence 14, 771–780 (1999)
Wang, Y., Liu, Z., Huang, J.: Multimedia Content Analysis. IEEE Signal Processing Magazine 17, 12–36 (2000)
Satoh, S., Nakamura, Y., Kanade, T.: Name-It: Naming and Detecting Faces in News Videos. IEEE Multimedia 6, 22–35 (1999)
Wang, D.: Unsupervised Video Segmentation Based on Watersheds and Temporal Tracking. IEEE Trans. Circuits & Systems for Video Tech. 8, 539–545 (1998)
Toklu, C., Tekalp, A.M., Erdem, A.T.: Simultaneous Alpha Map Generation and 2D Mesh Tracking for Multimedia Applications. ICIP 1, 113–116 (1997)
Wang, J., Kankanhalli, M.S.: Experience based Sampling Technique for Multimedia Analysis. Proc. ACM Multimedia, 319–322 (2003)
Ikeda, O.: Segmentation of Faces in Video Footage Using HSV Color for Face Detection and Image Retrieval. ICIP 3, 913–916 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ikeda, O. (2007). Detection of a Speaker in Video by Combined Analysis of Speech Sound and Mouth Movement. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2007. Lecture Notes in Computer Science, vol 4842. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76856-2_59
Download citation
DOI: https://doi.org/10.1007/978-3-540-76856-2_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76855-5
Online ISBN: 978-3-540-76856-2
eBook Packages: Computer ScienceComputer Science (R0)