Skip to main content

Audiovisual Analysis and Synthesis for Multimodal Human-Computer Interfaces

  • Chapter
  • First Online:
Engineering the User Interface

Abstract

Multimodal signal processing techniques are called to play a salient role in the implementation of natural computer-human interfaces. In particular, the development of efficient interface front ends that emulate interpersonal communication would benefit from the use of techniques capable of processing the visual and auditory modes jointly. This work introduces the application of audiovisual analysis and synthesis techniques based on Principal Component Analysis and Non-negative Matrix Factorization on facial audiovisual sequences. Furthermore, the applicability of the extracted audiovisual bases is analyzed throughout several experiments that evaluate the quality of audiovisual resynthesis using both objective and subjective criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allen J, Byron D, Dzikovska M, Ferguson G, Galescu L and Stent A (2001) Towards conversational human-computer interaction. AI Magazine 22(4):27–38

    Google Scholar 

  2. Bregler C, Covell M and Slaney M (1997) Video Rewrite: driving visual speech with audio. Proc. of the ACM Conference in Computer Graphics and Interactive Techniques, 353–360

    Google Scholar 

  3. Butz T and Thiran JP (2005) From error probability to information theoretic (multi-modal) signal processing. Signal Processing, 85(5):875–902

    Article  MATH  Google Scholar 

  4. Calle J, Martínez P and Valle D (2006) Hacia la realización de una interacción natural. Proc. of the VII International Conference on Human-Computer Interaction, 471–480(in Spanish)

    Google Scholar 

  5. Casey MA and Westner A (2000) Separation of mixed audio sources by Independent Subspace Analysis. Proc. of the International Computer Music Conference, 154–161

    Google Scholar 

  6. Cosatto E and Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. Computer Animation, 103–110

    Google Scholar 

  7. Ezzat T, Geiger G and Poggio T (2002) Trainable videorealistic speech animation. Proc. of the ACM Conference in Computer Graphics and Interactive Techniques, 225–228

    Google Scholar 

  8. Fagel S (2006) Joint Audio-Visual Unit Selection - The JAVUS speech synthesizer. Proc. of the International Conference on Speech and Computer

    Google Scholar 

  9. Fisher III JW, Darrell T, Freeman TW and Viola P (2000) Learning joint statistical models for audio-visual fusion and segregation. Advances in Neural Information Processing Systems, vol. 14

    Google Scholar 

  10. Golub G and Loan CV (1996) Matrix computations. The John Hopkins University Press

    Google Scholar 

  11. Hershey J and Movellan J (1999) Audio-vision: using audio-visual synchrony to locate sounds. Advances in Neural Information Processing Systems, vol. 12

    Google Scholar 

  12. Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5:1457–1469

    MathSciNet  MATH  Google Scholar 

  13. Hyvarinen A, Karhunen J and Oja E (2001) Independent Component Analysis. John Wiley and Sons

    Google Scholar 

  14. Jolliffe I (1986) Principal Component Analysis. Springler-Verlag

    Google Scholar 

  15. Kirby M (2001) Geometric data analysis: an empirical approach to dimensionality reduction and the study of patterns. John Wiley and Sons

    Google Scholar 

  16. Lee DD and Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791

    Article  Google Scholar 

  17. Lee DD and Seung HS (2000) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, vol. 13

    Google Scholar 

  18. Markel JE and Gray AH (1982) Linear prediction of speech. Springer-Verlag

    Google Scholar 

  19. Melenchón J, De la Torre F, Iriondo I, Alías F, Martínez E and Vicent Ll (2003) Text to visual synthesis with appearance models. Proc. of the IEEE International Conference on Image Processing, vol. 1, 237–240

    Google Scholar 

  20. Melenchón J, Iriondo I, Socoró JC and Martínez E (2003) Lip animation of a personalized facial model from auditory speech. Proc. IEEE International Symposium on Signal Processing and Information Technology, 187–190

    Google Scholar 

  21. Melenchón J, Meler L and Iriondo I (2004) On-the-fly training. Articulated Models and Deformable Objects, LNCS vol. 3179, pp. 146–153

    Google Scholar 

  22. Pantic M, Sebe N, Cohn JF and Huang T (2005) Affective multimodal human-computer interaction. Proc. of the 13th annual ACM International Conference on Multimedia, pp. 669–676

    Google Scholar 

  23. Papamichalis PE and Barnwell III TP (1983) Variable rate speech compression by encoding subsets of the PARCOR coefficients. IEEE Transactions on Acoustics, Speech and Signal Processing, 31(3):706–713

    Article  Google Scholar 

  24. Sevillano X, Melenchón J and Socoró JC (2006) Análisis y síntesis audiovisual para interfaces multimodales ordenador-persona. Proc. of the VII International Conference on Human-Computer Interaction, 481–490(in Spanish)

    Google Scholar 

  25. Slaney M and Covell M (2000) Facesync: a linear operator for measuring synchronization of video facial images and audio tracks. Advances in Neural Information Processing Systems, vol. 13

    Google Scholar 

  26. Smaragdis P and Casey M (2003) Audio/visual independent components. Proc. of the Fourth International Symposium on Independent Component Analysis and Blind Source Separation, 709–714

    Google Scholar 

  27. Smaragdis P and Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 177–180

    Google Scholar 

  28. Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics, 1: 80--83

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xavier Sevillano .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London

About this chapter

Cite this chapter

Sevillano, X., Melenchón, J., Cobo, G., Socoró, J.C., Alías, F. (2009). Audiovisual Analysis and Synthesis for Multimodal Human-Computer Interfaces. In: Redondo, M., Bravo, C., Ortega, M. (eds) Engineering the User Interface. Springer, London. https://doi.org/10.1007/978-1-84800-136-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-136-7_13

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-135-0

  • Online ISBN: 978-1-84800-136-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics