Abstract
Multimodal signal processing techniques are called to play a salient role in the implementation of natural computer-human interfaces. In particular, the development of efficient interface front ends that emulate interpersonal communication would benefit from the use of techniques capable of processing the visual and auditory modes jointly. This work introduces the application of audiovisual analysis and synthesis techniques based on Principal Component Analysis and Non-negative Matrix Factorization on facial audiovisual sequences. Furthermore, the applicability of the extracted audiovisual bases is analyzed throughout several experiments that evaluate the quality of audiovisual resynthesis using both objective and subjective criteria.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allen J, Byron D, Dzikovska M, Ferguson G, Galescu L and Stent A (2001) Towards conversational human-computer interaction. AI Magazine 22(4):27–38
Bregler C, Covell M and Slaney M (1997) Video Rewrite: driving visual speech with audio. Proc. of the ACM Conference in Computer Graphics and Interactive Techniques, 353–360
Butz T and Thiran JP (2005) From error probability to information theoretic (multi-modal) signal processing. Signal Processing, 85(5):875–902
Calle J, Martínez P and Valle D (2006) Hacia la realización de una interacción natural. Proc. of the VII International Conference on Human-Computer Interaction, 471–480(in Spanish)
Casey MA and Westner A (2000) Separation of mixed audio sources by Independent Subspace Analysis. Proc. of the International Computer Music Conference, 154–161
Cosatto E and Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. Computer Animation, 103–110
Ezzat T, Geiger G and Poggio T (2002) Trainable videorealistic speech animation. Proc. of the ACM Conference in Computer Graphics and Interactive Techniques, 225–228
Fagel S (2006) Joint Audio-Visual Unit Selection - The JAVUS speech synthesizer. Proc. of the International Conference on Speech and Computer
Fisher III JW, Darrell T, Freeman TW and Viola P (2000) Learning joint statistical models for audio-visual fusion and segregation. Advances in Neural Information Processing Systems, vol. 14
Golub G and Loan CV (1996) Matrix computations. The John Hopkins University Press
Hershey J and Movellan J (1999) Audio-vision: using audio-visual synchrony to locate sounds. Advances in Neural Information Processing Systems, vol. 12
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5:1457–1469
Hyvarinen A, Karhunen J and Oja E (2001) Independent Component Analysis. John Wiley and Sons
Jolliffe I (1986) Principal Component Analysis. Springler-Verlag
Kirby M (2001) Geometric data analysis: an empirical approach to dimensionality reduction and the study of patterns. John Wiley and Sons
Lee DD and Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791
Lee DD and Seung HS (2000) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, vol. 13
Markel JE and Gray AH (1982) Linear prediction of speech. Springer-Verlag
Melenchón J, De la Torre F, Iriondo I, Alías F, Martínez E and Vicent Ll (2003) Text to visual synthesis with appearance models. Proc. of the IEEE International Conference on Image Processing, vol. 1, 237–240
Melenchón J, Iriondo I, Socoró JC and Martínez E (2003) Lip animation of a personalized facial model from auditory speech. Proc. IEEE International Symposium on Signal Processing and Information Technology, 187–190
Melenchón J, Meler L and Iriondo I (2004) On-the-fly training. Articulated Models and Deformable Objects, LNCS vol. 3179, pp. 146–153
Pantic M, Sebe N, Cohn JF and Huang T (2005) Affective multimodal human-computer interaction. Proc. of the 13th annual ACM International Conference on Multimedia, pp. 669–676
Papamichalis PE and Barnwell III TP (1983) Variable rate speech compression by encoding subsets of the PARCOR coefficients. IEEE Transactions on Acoustics, Speech and Signal Processing, 31(3):706–713
Sevillano X, Melenchón J and Socoró JC (2006) Análisis y síntesis audiovisual para interfaces multimodales ordenador-persona. Proc. of the VII International Conference on Human-Computer Interaction, 481–490(in Spanish)
Slaney M and Covell M (2000) Facesync: a linear operator for measuring synchronization of video facial images and audio tracks. Advances in Neural Information Processing Systems, vol. 13
Smaragdis P and Casey M (2003) Audio/visual independent components. Proc. of the Fourth International Symposium on Independent Component Analysis and Blind Source Separation, 709–714
Smaragdis P and Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 177–180
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics, 1: 80--83
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London
About this chapter
Cite this chapter
Sevillano, X., Melenchón, J., Cobo, G., Socoró, J.C., Alías, F. (2009). Audiovisual Analysis and Synthesis for Multimodal Human-Computer Interfaces. In: Redondo, M., Bravo, C., Ortega, M. (eds) Engineering the User Interface. Springer, London. https://doi.org/10.1007/978-1-84800-136-7_13
Download citation
DOI: https://doi.org/10.1007/978-1-84800-136-7_13
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84800-135-0
Online ISBN: 978-1-84800-136-7
eBook Packages: Computer ScienceComputer Science (R0)