Mapping from Speech to Images Using Continuous State Space Models

Lehn-Schiøler, Tue; Hansen, Lars Kai; Larsen, Jan

doi:10.1007/978-3-540-30568-2_12

Tue Lehn-Schiøler¹⁸,
Lars Kai Hansen¹⁸ &
Jan Larsen¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3361))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

935 Accesses
4 Citations

Abstract

In this paper a system that transforms speech waveforms to animated faces are proposed. The system relies on continuous state space models to perform the mapping, this makes it possible to ensure video with no sudden jumps and allows continuous control of the parameters in ’face space’.

The performance of the system is critically dependent on the number of hidden variables, with too few variables the model cannot represent data, and with too many overfitting is noticed

Simulations are performed on recordings of 3-5 sec. video sequences with sentences from the Timit database. From a subjective point of view the model is able to construct an image sequence from an unknown noisy speech sequence even though the number of training examples are limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lewis, J.P.: Automated lip-sync: Background and techniques. J. Visualization and Computer Animation 2 (1991)
Google Scholar
Goldenthal, W., Waters, K., Jean-Manuel, T.V., Glickman, O.: Driving synthetic mouth gestures: Phonetic recognition for faceme! In: Proc. Eurospeech 1997, Rhodes, Greece, pp. 1995–1998 (1997)
Google Scholar
Ezzat, T., Poggio, T.: Mike talk: a talking facial display based on morphing visemes. In: Proc. Computer Animation IEEE Computer Society, pp. 96–102 (1998)
Google Scholar
Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 353–360. ACM Press/Addison-Wesley Publishing Co. (1997)
Google Scholar
Williams, J.J., Katsaggelos, A.K.: An hmm-based speech-to-video synthesizer. IEEE Transactions on Neural Networks 13 (2002)
Google Scholar
Hong, P., Wen, Z., Huang, T.S.: Speech driven face animation. In: Pandzic, I.S., Forchheimer, R. (eds.) MPEG-4 Facial Animation: The Standard, Implementation and Applications, Wiley, Europe (2002)
Google Scholar
Massaro, D.W., Beskow, J., Cohen, M.M., Fry, C.L., Rodriguez, T.: Picture my voice: Audio to visual speech synthesis using artificial neural networks. In: Proc. AVSP 1999 (1999)
Google Scholar
Brand, M.: Voice puppetry. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 21–28. ACM Press/Addison-Wesley Publishing Co. (1999)
Google Scholar
Lavagetto, F.: Converting speech into lip movements: A multimedia telephone for hard of hearing people. IEEE Trans. on Rehabilitation Engineering 3 (1995)
Google Scholar
McGurk, H., MacDonald, J.W.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Dupont, S., Luettin, J.: Audio-visual speech modelling for continuous speech recognition. IEEE Transactions on Multimedia (2000)
Google Scholar
McAllister, D.F., Rodman, R.D., Bitzer, D.L., Freeman, A.S.: Speaker independence in automated lip-sync for audio-video communication. Comput. Netw. ISDN Syst. 30, 1975–1980 (1998)
Article Google Scholar
Cootes, T., Edwards, G., Taylor, C.: Active appearance models. Proc. European Conference on Computer Vision 2, 484–498 (1998)
Google Scholar
Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, 198–213 (2002)
Article Google Scholar
Stegmann, M.B., Ersbøll, B.K., Larsen, R.: FAME-a flexible appearance modelling environment. IEEE Transactions on Medical Imaging 22, 1319–1331 (2003)
Article Google Scholar
Kalman, R.E.: A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering 82, 35–45 (1960)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. JRSSB 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Ghahramani, Z., Hinton, G.: Parameter estimation for linear dynamical systems. Technical report, University of Toronto, CRG-TR-96-2 (1996)
Google Scholar
Sanderson, C., Paliwal, K.K.: Polynomial features for robust face authentication. Proceedings of International Conference on Image Processing 3, 997–1000 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Informatics and Mathematical Modelling, The Technical University of Denmark, Richard Petersens Plads, Bld. 321
Tue Lehn-Schiøler, Lars Kai Hansen & Jan Larsen

Authors

Tue Lehn-Schiøler
View author publications
You can also search for this author in PubMed Google Scholar
Lars Kai Hansen
View author publications
You can also search for this author in PubMed Google Scholar
Jan Larsen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IDIAP Research Institute, Martigny, Switzerland
Samy Bengio
IDIAP Research Institute, CH-1920, Martigny, Switzerland
Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lehn-Schiøler, T., Hansen, L.K., Larsen, J. (2005). Mapping from Speech to Images Using Continuous State Space Models. In: Bengio, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2004. Lecture Notes in Computer Science, vol 3361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30568-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-540-30568-2_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24509-4
Online ISBN: 978-3-540-30568-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics