Learning Visual Models for Lipreading

Bregler, Christoph; Omohundro, Stephen M.

doi:10.1007/978-94-015-8935-2_13

Christoph Bregler⁴ &
Stephen M. Omohundro⁵

Part of the book series: Computational Imaging and Vision ((CIVI,volume 9))

309 Accesses
3 Citations

Abstract

This chapter describes learning techniques that are the basis of a “visual speech recognition” or “lipreading” system¹. Model-based vision systems currently have the best performance for many visual recognition tasks. For geometrically simple domains, models can sometimes be constructed by hand using CAD-like tools. Such models are difficult and expensive to construct, however, and are inadequate for more complex domains. To do model-based lipreading, we would like a parameterized model of the complex “space of lip configurations”. Rather than building such a model by hand, our approach is to have the system itself build it using machine learning. The system is given a collection of training images which it uses to automatically construct the models that are later used in recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Adjoudani and C. Benoit. On the integration of auditory and visual parameters in an hmm-based asr. In NATO Advanced Study Institute on Speechreading by Man and Machine, 1995.
Google Scholar
D. Beymer, A. Shashua, and T. Poggio. Example based image analysis and synthesis. M.I.T. A.I. Memo No. 1431, Nov 1993.
Google Scholar
H.A. Bourlard and Morgan N. Connectionist Speech Recognition, A Hybrid Approach. Kluwer Academic Publishers, 1993.
Google Scholar
C. Bregler, H. Hild, S. Manke, and A. Waibel. Improving connected letter recognition by lipreading. In Int. Conf. Acoustics, Speech,and Signal Processing,volume 1, pages 557–560, Minneapolis, 1993. IEEE.
Google Scholar
C. Bregler and S.M. Omohundro. Nonlinear manifold learning for visual speech recognition. In W. Eric L. Grimson, editor, Proceedings of the Fifth International Conference on Computer Vision, pages 494–499, 10662 Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA 90720–1264, June 1995. IEEE Computer Society Press.
Google Scholar
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1977.
Google Scholar
Alan J. Goldschen. Continuous Automatic Speech Recognition by Lipreading. PhD thesis, Dept. of Electrical Engineering and Computer Science, George Washington University, 1993.
Google Scholar
H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. Rasta-plp speech analysis technique. In Proc. ICASSP, San Francisco, 1992.
Google Scholar
M.I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6 (2), March 1994.
Google Scholar
Michael Kass, Andrew Witkin, and Demetri Terzopoulus. Snakes: Active contour models. International Journal of Computer Vision, 1 (4): 321–331, 1987.
Article Google Scholar
R. Kaucic, B. Dalton, and A. Blake. Real-time lip tracking for audio-visual speech recognition applications. In 4th European Conf. Computer Vision, April 1996.
Google Scholar
M. Kirby, F. Weisser, and A. Dangelmayr. A model problem in represention of digitial image sequences. Pattern Recognition, 26 (1), 1993.
Google Scholar
J. Luettin, N. A. Thacker, and S. W. Beet. Visual speech recognition using active shape models and hidden markov models. In to appear in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 1996.
Google Scholar
Kenji Mase and Alex Pentland. Lip reading: Automatic visual recognition of spoken words. Opt. Soc. Am. Topical Meeting on Machine Vision, pages 1565–1570, June 1989.
Google Scholar
Dominic W. Massaro and Michael M. Cohen. Evaluation and integration of visual and auditory information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 9: 753–771, 1983.
Google Scholar
Javier R. Movellan. Visual speech recognition with stochastic networks. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7. MIT press, Cambridge, 1995.
Google Scholar
H. Murase and S.K. Nayar. Visual learning and recognition of 3-d objects from appearance. Int. J. Computer Vision, 14 (1): 5–24, January 1995.
Article Google Scholar
L. Nan, S. Dettmer, and M. Shah. Visual lipreading using eigensequences. In Proc. of the Int. Workshop on Automatic Face-and Gesture-Recognition, Zurich, 1995, 1995.
Google Scholar
Eric D. Petajan. Automatic Lipreading to Enhance Speech Recognition. PhD thesis, University of Illinois at Urbana-Champaign, 1984.
Google Scholar
Peter L. Silsbee. Sensory integration in audiovisual automatic speech recognition. In 28th Annual Asilomar Conf. on Signals, Systems, and Computers, pages 561–565, November 1994.
Google Scholar
P. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transformation distance. In Advances in Neural Information Processing Systems, 1993.
Google Scholar
M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3 (1): 71–86, 1991.
Article Google Scholar
Greg J. Wolff, K. Venkatesh Prasad, David G. Stork, and Marcus E. Hennecke. Lipreading by neural networks: Visual preprocessing, learning and sensory integration. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 1027–1034. Morgan Kaufmann, 1994.
Google Scholar
Ben P. Yuhas, Moise H. Goldstein, Terence J. Sejnowski, and Robert E. Jenkins. Neural network models of sensory integration for improved vowel recognition. Proc. IEEE, 78 (10): 1658–1668, October 1990.
Article Google Scholar
Alan L. Yuille, David S. Cohen, and Peter W. Hallinan. Facial feature extraction by deformable templates. Technical Report 88–2, Harvard Robotics Laboratory, 1988.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Division, U.C. Berkeley, Berkeley, CA, 94720, USA
Christoph Bregler
NEC Research Institute, Inc., 4 Independence Way, Princeton, NJ, 08540, USA
Stephen M. Omohundro

Authors

Christoph Bregler
View author publications
You can also search for this author in PubMed Google Scholar
Stephen M. Omohundro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Vision Laboratory, Computer Science Department, University of Central Florida, 32816, Orlando, Florida, USA
Mubarak Shah
Electrical and Computer Engineering, University of California, San Diego, 92137, San Diego, California, USA
Ramesh Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bregler, C., Omohundro, S.M. (1997). Learning Visual Models for Lipreading. In: Shah, M., Jain, R. (eds) Motion-Based Recognition. Computational Imaging and Vision, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-8935-2_13

Download citation

DOI: https://doi.org/10.1007/978-94-015-8935-2_13
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-4870-7
Online ISBN: 978-94-015-8935-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics