Skip to main content

Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation

  • Conference paper
Image Analysis and Recognition (ICIAR 2009)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 5627))

Included in the following conference series:

  • 2201 Accesses

Abstract

In this paper, we propose a novel correlation based method for speech-video synchronization (synch) and relationship classification. The method uses the envelope of the speech signal and data extracted from the lips movement. Firstly, a nonlinear-time-varying model is considered to represent the speech signal as a sum of amplitude and frequency modulated (AM-FM) signals. Each AM-FM signal, in this sum, is considered to model a single speech formant frequency. Using Taylor series expansion, the model is formulated in a way which characterizes the relation between the speech amplitude and the instantaneous frequency of each AM-FM signal w.r.t lips movements. Secondly, the envelope of the speech signal is estimated and then correlated with signals generated from lips movement. From the resultant correlation, the relation between the two signals is classified and the delay between them is estimated. The proposed method is applied to real cases and the results show that it is able to (i) classify if the speech and the video signals belong to the same source, (ii) estimate delays between audio and video signals that are as small as 0.1 second when speech signals are noisy and 0.04 second when the additive noise is less significant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Garcia, J.O., Bigun, J., Reynolds, D., Rodriguez, J.G.: Authentication Gets Personal with Biometrics. IEEE Sig. Proc. Mag. 21(2), 50–62 (2004)

    Article  Google Scholar 

  2. Mian, A.S., Bennamoun, M., Owens, R.: 2D and 3D multimodal hybrid face recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 344–355. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  3. Gordon, E., Harold, L.: Control Methods Used in a Study of the Vowels. J. Acoust. Soc. of America 24, 175–184 (1952)

    Article  Google Scholar 

  4. Sundberg, J., Nordström, P.-E.: Raised and lowered larynx - the effect on vowel formant frequencies. J. STL-QPSR 17, 035–039 (1976), http://www.speech.kth.se/qpsr

  5. Lewis, J.: Automated Lip-Sync: Background and Techniques. J. of Visualization and Computer Animation 2, 118–122 (1991)

    Article  Google Scholar 

  6. Koster, B., Rodman, R., Bitzer, D.: Automated Lip-Sync: Direct Translation of Speech-Sound to Mouth-Shape. In: 28th IEEE Annual Asilomar Conf. on Sig., Sys. and Comp., pp. 33–46 (1994)

    Google Scholar 

  7. Chen, T., Graph, H., Wang, K.: Lip Synchronization Using Speech-Assisted Video Processing. IEEE Sig. Proc. Letters 2, 57–59 (1995)

    Article  Google Scholar 

  8. McClean, M.D.: Lip-muscle reflexes during speech movement preparation in stutterers. Journal of Fluency Disorders 21, 49–60 (1996)

    Article  Google Scholar 

  9. Zhang, B., Fukui, Y.: Research On An Automated Speech Pattern Recognition System Based On Lip Movement. In: 18th Annual Inter Conf. of the IEEE Eng. in Med. and Bio. Society, vol. 4, pp. 1530–1531 (1996)

    Google Scholar 

  10. Mori, K., Sonoda, Y.: Relationship between lip shapes and acoustical characteristics during speech. Acoust. Soc. of America and Acoust. Soc. of Japan 2pSC22, 879–882 (1996)

    Google Scholar 

  11. Huang, F., Chen, T.: Real-Time Lip-Synch Face Animation Driven By Human Voice. In: IEEE 2nd Multimedia Sig. Proc., pp. 352–357 (1998)

    Google Scholar 

  12. Potamianos, A., Maragos, P.: Speech analysis and synthesis using an AM-FM modulation model. Elsevier in Speech Commun. 28, 195–209 (1999)

    Article  Google Scholar 

  13. Ezzaidi, H., Rouat, J.: Comparison of MFCC and pitch synchronous AM, FM parameters for speaker identification. ICSLP 2, 318–321 (2000)

    Google Scholar 

  14. Barbosa, A., Yehia, H.: Measuring The Relation Between Speech Acoustics and 2D Facial Motion. In: IEEE ICASSP 2001, vol. 1, pp. 181–184 (2001)

    Google Scholar 

  15. Ogata, S., Murai, K., Nakamura, S., Morishima, S.: Model-Based Lip Synchronization With Automatically Translated Synthetic Voice Toward A Multi-Modal Translation System. In: IEEE Inter. Conf. on Multimedia and Expo., pp. 28–31 (2001)

    Google Scholar 

  16. Dimitriadis, D.-V., Maragos, P., Potamianos, A.: Robust AM-FM Features for Speech Recognition. IEEE Sig. Proc. Lett. 12, 621–624 (2005)

    Article  Google Scholar 

  17. Caslon Analytics profile identity crime (2006), http://www.caslon.com.au/idtheftprofile.htm

  18. Groot, C., Davis, C.: Auditory-Visual Speech Recognition with Amplitude and Frequency Modulations. In: 11th Australian Inter. Conf. on Speech Science & Technology (2006)

    Google Scholar 

  19. Ellis, D.: Speech and Audio Processing and Recognition, A course and publications (2006), http://www.ee.columbia.edu/~dpwe/e6820/

  20. Englebienne, G., Cootes, T., Rattray, M.: A probabilistic model for generating realistic lip movements from speech. In: NIPS 2007 (2007), http://books.nips.cc/papers/files/nips20/NIPS2007_0442.pdf

  21. Sanderson, C.: Biometric Person Recognition: Face, Speech and Fusion. VDM Verlag (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

El-Sallam, A.A., Mian, A.S. (2009). Speech-Video Synchronization Using Lips Movements and Speech Envelope Correlation. In: Kamel, M., Campilho, A. (eds) Image Analysis and Recognition. ICIAR 2009. Lecture Notes in Computer Science, vol 5627. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02611-9_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02611-9_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02610-2

  • Online ISBN: 978-3-642-02611-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics