Advertisement

Continuous audio-visual speech recognition

  • Juergen Luettin
  • Stéphane Dupont
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1407)

Abstract

We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This approach allows the use of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately. The system has been evaluated for a continuously spoken digit recognition task of 37 subjects.

Keywords

Speech Recognition Automatic Speech Recognition Speech Recognition System Visual Speech Clean Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    J. B. Allen. How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4):567–577, 1994.CrossRefGoogle Scholar
  2. 2.
    S. Basu, N. Oliver, and A. Pentland. 3D Modeling and Tracking of Human Lip Motion. In IEEE International Conference on Computer Vision, 1998.Google Scholar
  3. 3.
    H. Bourlard, S. Dupont, and C. Riss. Multi-stream speech recognition. Technical Report IDIAP-RR 96-07, IDIAP, 1996.Google Scholar
  4. 4.
    H. Bourlard, and S. Dupont. Sub-band-based Speech Recognition. In IEEE Int. Conf. on Acoust., Speech, and Signal Processing, pages 1251–1254, 1997.Google Scholar
  5. 5.
    L. Braida. Crossmodal integration in the identification of consonants. Quarterly Journal of Experimental Psychology, 43A(3):647–677, 1991.Google Scholar
  6. 6.
    C. Bregler and S. M. Omohundro. Nonlinear manifold learning for visual speech recognition. In IEEE International Conference on Computer Vision, pages 494–499. IEEE, Piscataway, NJ, USA, 1995.Google Scholar
  7. 7.
    G. Chollet, J. L. Cochard, A. Constantinescu, and P. Langlais. Swiss French Polyphone and Polyvar: Telephone speech databases to study intra and inter speaker variability. Technical report, IDIAP, Martigny, 1995.Google Scholar
  8. 8.
    T. Coianiz, L. Torresani, and B. Capril. 2D deformable models for visual speech analysis. In David G. Stork and Marcus E. Hennecke, editors, Speechreading by Humans and Machines, volume 150 of NATO ASI Series, Series F: Computer and Systems Sciences, pages 391–398. Springer Verlag, Berlin, 1996.Google Scholar
  9. 9.
    R. Cole, L. Hirschmann, L Atlas, and et al. The challenge of spoken language processing: research directions for the nineties. IEEE Trans. on Speech and Audio Processing, 3(1):1–20, 1995.CrossRefGoogle Scholar
  10. 10.
    T. F. Cootes, A. Hill, C. J. Taylor, and J. Haslam. Use of active shape models for locating structures in medical images. Image and Vision Computing, 12:355–365, Jul–Aug 1994.CrossRefGoogle Scholar
  11. 11.
    T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models — their training and application. Computer Vision and Image Understanding, 61:38–59, Jan 1995.CrossRefGoogle Scholar
  12. 12.
    N. P. Erber and C. L. De Filippo. Voice-mouth synthesis of tactual/visual perception of /pa, ba, ma/. Journal of the Acoustical Society of America, 64:1015–1019, 1978.CrossRefGoogle Scholar
  13. 13.
    I. A. Essa and A. P. Pentland. Facial expression recognition using a dynamic model and motion energy. In Proc. 5th Int. Conf. on Computer Vision, pages 360–367. IEEE Computer Society Press, July 1995.Google Scholar
  14. 14.
    H. Fletcher. Speech and Hearing in Communication. Krieger, New York, 1953.Google Scholar
  15. 15.
    Y. Gong. Speech recognition in noisy environments: A survey. Speech Communication, 16:261–291, 1995.CrossRefGoogle Scholar
  16. 16.
    M. S. Gray, J. R. Movellan, and T. J. Sejnowski. Dynamic features for visual speechreading: A systematic comparison. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, Cambridge, MA, 1997.Google Scholar
  17. 17.
    K. P. Green and J. L. Miller. On the role of visual rate information in phonetic perception. Perception & Psychophysics, 38(3):269–276, 1985.Google Scholar
  18. 18.
    W. J. Hardcastle. Physiology of Speech Production. Academic Press, New York, NY, 1976.Google Scholar
  19. 19.
    H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87:1738–1752, 1990.CrossRefGoogle Scholar
  20. 20.
    B. K. P. Horn. Robot Vision. McGraw-Hill, New York, 1986.Google Scholar
  21. 21.
    M. I. Jordan and Z. Ghahramani and L. K. Saul. Hidden Markov Decision Trees. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, Cambridge, MA, 1997.Google Scholar
  22. 22.
    A. Lanitis, C. J. Taylor, and T. F. Cootes. Automatic interpretation and coding of face images using flexible models. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):743–756, 1997.CrossRefGoogle Scholar
  23. 23.
    J. Luettin and N. A. Thacker. Speechreading using probabilistic models. Computer Vision and Image Understanding, 65(2):163–178, February 1997.CrossRefGoogle Scholar
  24. 24.
    J. Luettin. Visual Speech and Speaker Recognition. PhD thesis, University of Sheffield, 1997.Google Scholar
  25. 25.
    K. Mase and A. Pentland. Automatic lipreading by optical flow analysis. Systems and Computers in Japan, 22(6), 1991.Google Scholar
  26. 26.
    B. Moghaddam and A. Pentland. Probabilistic visual learning for object detection. In IEEE International Conference on Computer Vision, pages 786–793. IEEE, Piscataway, NJ, USA, 1995.Google Scholar
  27. 27.
    E. D. Petajan. Automatic lipreading to enhance speech recognition. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 40–47, 1985.Google Scholar
  28. 28.
    S. Pigeon and L. Vandendorpe. The M2VTS multimodal face database. In Proceedings of the First International Conference on Audio-and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science. Springer Verlag, 1997.Google Scholar
  29. 29.
    M. U. Ramos Sanchez, J. Matas, and J. Kittler. Statistical chromaticity models for lip tracking with B-splines. In Proceedings of the First International Conference on Audio-and Video-based Biometric Person Authentication, Lecture Notes in Computer Science, pages 69–76. Springer Verlag, 1997.Google Scholar
  30. 30.
    P. L. Silsbee and A. C. Bovik. Computer lipreading for improved accuracy in automatic speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5):337–351, 1996.CrossRefGoogle Scholar
  31. 31.
    A. Q. Summerfield. Lipreading and audio-visual speech perception. Philosophical Transactions of the Royal Society of London, Series B, 335:71–78, 1992.Google Scholar
  32. 32.
    M. J. Tomlinson, M. J. Russell, and N. M. Brooke. Integrating audio and visual information to provide highly robust speech recognition. In Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing, volume 2, pages 821–824, 1996.Google Scholar
  33. 33.
    A. Varga and R. Moore. Hidden markov model decomposition of speech and noise. In Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing, pages 845–848, 1990.Google Scholar
  34. 34.
    B. P. Yuhas, M. H. Goldstein, T. J. Sejnowski, and R. E. Jenkins. Neural network models of sensory integration for improved vowel recognition. Proc. IEEE, 78(10):1658–1668, October 1990.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Juergen Luettin
    • 1
  • Stéphane Dupont
    • 2
    • 1
  1. 1.IDIAP - Dalle Molle Institute for Perceptual Artificial IntelligenceMartignySwitzerland
  2. 2.Faculté Polytechnique de Mons - TCTS 31MonsBelgium

Personalised recommendations