Skip to main content

Mapping from Speech to Images Using Continuous State Space Models

  • Conference paper
Machine Learning for Multimodal Interaction (MLMI 2004)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3361))

Included in the following conference series:

Abstract

In this paper a system that transforms speech waveforms to animated faces are proposed. The system relies on continuous state space models to perform the mapping, this makes it possible to ensure video with no sudden jumps and allows continuous control of the parameters in ’face space’.

The performance of the system is critically dependent on the number of hidden variables, with too few variables the model cannot represent data, and with too many overfitting is noticed

Simulations are performed on recordings of 3-5 sec. video sequences with sentences from the Timit database. From a subjective point of view the model is able to construct an image sequence from an unknown noisy speech sequence even though the number of training examples are limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lewis, J.P.: Automated lip-sync: Background and techniques. J. Visualization and Computer Animation 2 (1991)

    Google Scholar 

  2. Goldenthal, W., Waters, K., Jean-Manuel, T.V., Glickman, O.: Driving synthetic mouth gestures: Phonetic recognition for faceme! In: Proc. Eurospeech 1997, Rhodes, Greece, pp. 1995–1998 (1997)

    Google Scholar 

  3. Ezzat, T., Poggio, T.: Mike talk: a talking facial display based on morphing visemes. In: Proc. Computer Animation IEEE Computer Society, pp. 96–102 (1998)

    Google Scholar 

  4. Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 353–360. ACM Press/Addison-Wesley Publishing Co. (1997)

    Google Scholar 

  5. Williams, J.J., Katsaggelos, A.K.: An hmm-based speech-to-video synthesizer. IEEE Transactions on Neural Networks 13 (2002)

    Google Scholar 

  6. Hong, P., Wen, Z., Huang, T.S.: Speech driven face animation. In: Pandzic, I.S., Forchheimer, R. (eds.) MPEG-4 Facial Animation: The Standard, Implementation and Applications, Wiley, Europe (2002)

    Google Scholar 

  7. Massaro, D.W., Beskow, J., Cohen, M.M., Fry, C.L., Rodriguez, T.: Picture my voice: Audio to visual speech synthesis using artificial neural networks. In: Proc. AVSP 1999 (1999)

    Google Scholar 

  8. Brand, M.: Voice puppetry. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 21–28. ACM Press/Addison-Wesley Publishing Co. (1999)

    Google Scholar 

  9. Lavagetto, F.: Converting speech into lip movements: A multimedia telephone for hard of hearing people. IEEE Trans. on Rehabilitation Engineering 3 (1995)

    Google Scholar 

  10. McGurk, H., MacDonald, J.W.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)

    Article  Google Scholar 

  11. Dupont, S., Luettin, J.: Audio-visual speech modelling for continuous speech recognition. IEEE Transactions on Multimedia (2000)

    Google Scholar 

  12. McAllister, D.F., Rodman, R.D., Bitzer, D.L., Freeman, A.S.: Speaker independence in automated lip-sync for audio-video communication. Comput. Netw. ISDN Syst. 30, 1975–1980 (1998)

    Article  Google Scholar 

  13. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. Proc. European Conference on Computer Vision 2, 484–498 (1998)

    Google Scholar 

  14. Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, 198–213 (2002)

    Article  Google Scholar 

  15. Stegmann, M.B., Ersbøll, B.K., Larsen, R.: FAME-a flexible appearance modelling environment. IEEE Transactions on Medical Imaging 22, 1319–1331 (2003)

    Article  Google Scholar 

  16. Kalman, R.E.: A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering 82, 35–45 (1960)

    Google Scholar 

  17. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. JRSSB 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  18. Ghahramani, Z., Hinton, G.: Parameter estimation for linear dynamical systems. Technical report, University of Toronto, CRG-TR-96-2 (1996)

    Google Scholar 

  19. Sanderson, C., Paliwal, K.K.: Polynomial features for robust face authentication. Proceedings of International Conference on Image Processing 3, 997–1000 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lehn-Schiøler, T., Hansen, L.K., Larsen, J. (2005). Mapping from Speech to Images Using Continuous State Space Models. In: Bengio, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2004. Lecture Notes in Computer Science, vol 3361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30568-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30568-2_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24509-4

  • Online ISBN: 978-3-540-30568-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics