Skip to main content

Detection of a Speaker in Video by Combined Analysis of Speech Sound and Mouth Movement

  • Conference paper
Advances in Visual Computing (ISVC 2007)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4842))

Included in the following conference series:

  • 2908 Accesses

Abstract

We present a robust method to detect and locate a speaker using a joint analysis of speech sound and video image. First, the short speech sound data is analyzed to estimate the rate of spoken syllables, and a difference image is formed using the optimal frame distance derived from the rate to detect the candidates of mouth. Then, they are tracked to positively prove that one of the candidates is the mouth; the rate of mouth movements is estimated from the brightness change profiles for the first candidate and, if both the rates agree, the three brightest parts are detected in the resulting difference image as mouth and eyes. If not, the second candidate is tracked and so on. The first-order moment of the power spectrum of the brightness change profile and the lateral shifts in the tracking are also used to check whether or not they are facial parts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Chellappa, R., Wilson, C.L., Sirohey, S.: Human and Machine Recognition of Faces: A Survey. Proc. IEEE 83, 705–740 (1995)

    Article  Google Scholar 

  2. Yang, M.-H., Kriegman, D.J., Ahuja, N.: Detecting Faces in Images: A Survey. IEEE Trans. PAMI 24, 34–58 (2002)

    Google Scholar 

  3. Rowley, H.A., Baluja, S., Kanade, T.: Neural Network-Based Face Detection. IEEE Trans. PAMI 20, 23–38 (1998)

    Google Scholar 

  4. Schneiderman, H., Kanade, T.: Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition. CVPR, 45–51 (1998)

    Google Scholar 

  5. Osuna, E., Freund, R., Girosi, G.: Training Support Vector Machines: An Application to Face Detection. In: CVPR, pp. 130–136 (1997)

    Google Scholar 

  6. Turk, M.A., Pentland, A.P.: Eigenfaces for pattern recognition. J. Cognitive Neuroscience 3, 71–96 (1991)

    Article  Google Scholar 

  7. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. CVPR 1, 511–518 (2001)

    Google Scholar 

  8. Fröba, B., Ernst, A., Küblbeck, C.: Real-Time Face Detection. In: Proc. 4 th IASTED Signal and Image Processing, pp. 497–502 (2002)

    Google Scholar 

  9. Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. J. Jpn. Soc. Artificial Intelligence 14, 771–780 (1999)

    Google Scholar 

  10. Wang, Y., Liu, Z., Huang, J.: Multimedia Content Analysis. IEEE Signal Processing Magazine 17, 12–36 (2000)

    Article  Google Scholar 

  11. Satoh, S., Nakamura, Y., Kanade, T.: Name-It: Naming and Detecting Faces in News Videos. IEEE Multimedia 6, 22–35 (1999)

    Article  Google Scholar 

  12. Wang, D.: Unsupervised Video Segmentation Based on Watersheds and Temporal Tracking. IEEE Trans. Circuits & Systems for Video Tech. 8, 539–545 (1998)

    Article  Google Scholar 

  13. Toklu, C., Tekalp, A.M., Erdem, A.T.: Simultaneous Alpha Map Generation and 2D Mesh Tracking for Multimedia Applications. ICIP 1, 113–116 (1997)

    Google Scholar 

  14. Wang, J., Kankanhalli, M.S.: Experience based Sampling Technique for Multimedia Analysis. Proc. ACM Multimedia, 319–322 (2003)

    Google Scholar 

  15. Ikeda, O.: Segmentation of Faces in Video Footage Using HSV Color for Face Detection and Image Retrieval. ICIP 3, 913–916 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

George Bebis Richard Boyle Bahram Parvin Darko Koracin Nikos Paragios Syeda-Mahmood Tanveer Tao Ju Zicheng Liu Sabine Coquillart Carolina Cruz-Neira Torsten Müller Tom Malzbender

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ikeda, O. (2007). Detection of a Speaker in Video by Combined Analysis of Speech Sound and Mouth Movement. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2007. Lecture Notes in Computer Science, vol 4842. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76856-2_59

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76856-2_59

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76855-5

  • Online ISBN: 978-3-540-76856-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics