Skip to main content

Learning Visual Models for Lipreading

  • Chapter
Motion-Based Recognition

Part of the book series: Computational Imaging and Vision ((CIVI,volume 9))

Abstract

This chapter describes learning techniques that are the basis of a “visual speech recognition” or “lipreading” system1. Model-based vision systems currently have the best performance for many visual recognition tasks. For geometrically simple domains, models can sometimes be constructed by hand using CAD-like tools. Such models are difficult and expensive to construct, however, and are inadequate for more complex domains. To do model-based lipreading, we would like a parameterized model of the complex “space of lip configurations”. Rather than building such a model by hand, our approach is to have the system itself build it using machine learning. The system is given a collection of training images which it uses to automatically construct the models that are later used in recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Adjoudani and C. Benoit. On the integration of auditory and visual parameters in an hmm-based asr. In NATO Advanced Study Institute on Speechreading by Man and Machine, 1995.

    Google Scholar 

  2. D. Beymer, A. Shashua, and T. Poggio. Example based image analysis and synthesis. M.I.T. A.I. Memo No. 1431, Nov 1993.

    Google Scholar 

  3. H.A. Bourlard and Morgan N. Connectionist Speech Recognition, A Hybrid Approach. Kluwer Academic Publishers, 1993.

    Google Scholar 

  4. C. Bregler, H. Hild, S. Manke, and A. Waibel. Improving connected letter recognition by lipreading. In Int. Conf. Acoustics, Speech,and Signal Processing,volume 1, pages 557–560, Minneapolis, 1993. IEEE.

    Google Scholar 

  5. C. Bregler and S.M. Omohundro. Nonlinear manifold learning for visual speech recognition. In W. Eric L. Grimson, editor, Proceedings of the Fifth International Conference on Computer Vision, pages 494–499, 10662 Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA 90720–1264, June 1995. IEEE Computer Society Press.

    Google Scholar 

  6. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1977.

    Google Scholar 

  7. Alan J. Goldschen. Continuous Automatic Speech Recognition by Lipreading. PhD thesis, Dept. of Electrical Engineering and Computer Science, George Washington University, 1993.

    Google Scholar 

  8. H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. Rasta-plp speech analysis technique. In Proc. ICASSP, San Francisco, 1992.

    Google Scholar 

  9. M.I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6 (2), March 1994.

    Google Scholar 

  10. Michael Kass, Andrew Witkin, and Demetri Terzopoulus. Snakes: Active contour models. International Journal of Computer Vision, 1 (4): 321–331, 1987.

    Article  Google Scholar 

  11. R. Kaucic, B. Dalton, and A. Blake. Real-time lip tracking for audio-visual speech recognition applications. In 4th European Conf. Computer Vision, April 1996.

    Google Scholar 

  12. M. Kirby, F. Weisser, and A. Dangelmayr. A model problem in represention of digitial image sequences. Pattern Recognition, 26 (1), 1993.

    Google Scholar 

  13. J. Luettin, N. A. Thacker, and S. W. Beet. Visual speech recognition using active shape models and hidden markov models. In to appear in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 1996.

    Google Scholar 

  14. Kenji Mase and Alex Pentland. Lip reading: Automatic visual recognition of spoken words. Opt. Soc. Am. Topical Meeting on Machine Vision, pages 1565–1570, June 1989.

    Google Scholar 

  15. Dominic W. Massaro and Michael M. Cohen. Evaluation and integration of visual and auditory information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 9: 753–771, 1983.

    Google Scholar 

  16. Javier R. Movellan. Visual speech recognition with stochastic networks. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7. MIT press, Cambridge, 1995.

    Google Scholar 

  17. H. Murase and S.K. Nayar. Visual learning and recognition of 3-d objects from appearance. Int. J. Computer Vision, 14 (1): 5–24, January 1995.

    Article  Google Scholar 

  18. L. Nan, S. Dettmer, and M. Shah. Visual lipreading using eigensequences. In Proc. of the Int. Workshop on Automatic Face-and Gesture-Recognition, Zurich, 1995, 1995.

    Google Scholar 

  19. Eric D. Petajan. Automatic Lipreading to Enhance Speech Recognition. PhD thesis, University of Illinois at Urbana-Champaign, 1984.

    Google Scholar 

  20. Peter L. Silsbee. Sensory integration in audiovisual automatic speech recognition. In 28th Annual Asilomar Conf. on Signals, Systems, and Computers, pages 561–565, November 1994.

    Google Scholar 

  21. P. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transformation distance. In Advances in Neural Information Processing Systems, 1993.

    Google Scholar 

  22. M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3 (1): 71–86, 1991.

    Article  Google Scholar 

  23. Greg J. Wolff, K. Venkatesh Prasad, David G. Stork, and Marcus E. Hennecke. Lipreading by neural networks: Visual preprocessing, learning and sensory integration. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 1027–1034. Morgan Kaufmann, 1994.

    Google Scholar 

  24. Ben P. Yuhas, Moise H. Goldstein, Terence J. Sejnowski, and Robert E. Jenkins. Neural network models of sensory integration for improved vowel recognition. Proc. IEEE, 78 (10): 1658–1668, October 1990.

    Article  Google Scholar 

  25. Alan L. Yuille, David S. Cohen, and Peter W. Hallinan. Facial feature extraction by deformable templates. Technical Report 88–2, Harvard Robotics Laboratory, 1988.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Bregler, C., Omohundro, S.M. (1997). Learning Visual Models for Lipreading. In: Shah, M., Jain, R. (eds) Motion-Based Recognition. Computational Imaging and Vision, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-8935-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-94-015-8935-2_13

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-4870-7

  • Online ISBN: 978-94-015-8935-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics