Abstract
This chapter describes learning techniques that are the basis of a “visual speech recognition” or “lipreading” system1. Model-based vision systems currently have the best performance for many visual recognition tasks. For geometrically simple domains, models can sometimes be constructed by hand using CAD-like tools. Such models are difficult and expensive to construct, however, and are inadequate for more complex domains. To do model-based lipreading, we would like a parameterized model of the complex “space of lip configurations”. Rather than building such a model by hand, our approach is to have the system itself build it using machine learning. The system is given a collection of training images which it uses to automatically construct the models that are later used in recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Adjoudani and C. Benoit. On the integration of auditory and visual parameters in an hmm-based asr. In NATO Advanced Study Institute on Speechreading by Man and Machine, 1995.
D. Beymer, A. Shashua, and T. Poggio. Example based image analysis and synthesis. M.I.T. A.I. Memo No. 1431, Nov 1993.
H.A. Bourlard and Morgan N. Connectionist Speech Recognition, A Hybrid Approach. Kluwer Academic Publishers, 1993.
C. Bregler, H. Hild, S. Manke, and A. Waibel. Improving connected letter recognition by lipreading. In Int. Conf. Acoustics, Speech,and Signal Processing,volume 1, pages 557–560, Minneapolis, 1993. IEEE.
C. Bregler and S.M. Omohundro. Nonlinear manifold learning for visual speech recognition. In W. Eric L. Grimson, editor, Proceedings of the Fifth International Conference on Computer Vision, pages 494–499, 10662 Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA 90720–1264, June 1995. IEEE Computer Society Press.
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1977.
Alan J. Goldschen. Continuous Automatic Speech Recognition by Lipreading. PhD thesis, Dept. of Electrical Engineering and Computer Science, George Washington University, 1993.
H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. Rasta-plp speech analysis technique. In Proc. ICASSP, San Francisco, 1992.
M.I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6 (2), March 1994.
Michael Kass, Andrew Witkin, and Demetri Terzopoulus. Snakes: Active contour models. International Journal of Computer Vision, 1 (4): 321–331, 1987.
R. Kaucic, B. Dalton, and A. Blake. Real-time lip tracking for audio-visual speech recognition applications. In 4th European Conf. Computer Vision, April 1996.
M. Kirby, F. Weisser, and A. Dangelmayr. A model problem in represention of digitial image sequences. Pattern Recognition, 26 (1), 1993.
J. Luettin, N. A. Thacker, and S. W. Beet. Visual speech recognition using active shape models and hidden markov models. In to appear in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 1996.
Kenji Mase and Alex Pentland. Lip reading: Automatic visual recognition of spoken words. Opt. Soc. Am. Topical Meeting on Machine Vision, pages 1565–1570, June 1989.
Dominic W. Massaro and Michael M. Cohen. Evaluation and integration of visual and auditory information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 9: 753–771, 1983.
Javier R. Movellan. Visual speech recognition with stochastic networks. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7. MIT press, Cambridge, 1995.
H. Murase and S.K. Nayar. Visual learning and recognition of 3-d objects from appearance. Int. J. Computer Vision, 14 (1): 5–24, January 1995.
L. Nan, S. Dettmer, and M. Shah. Visual lipreading using eigensequences. In Proc. of the Int. Workshop on Automatic Face-and Gesture-Recognition, Zurich, 1995, 1995.
Eric D. Petajan. Automatic Lipreading to Enhance Speech Recognition. PhD thesis, University of Illinois at Urbana-Champaign, 1984.
Peter L. Silsbee. Sensory integration in audiovisual automatic speech recognition. In 28th Annual Asilomar Conf. on Signals, Systems, and Computers, pages 561–565, November 1994.
P. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transformation distance. In Advances in Neural Information Processing Systems, 1993.
M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3 (1): 71–86, 1991.
Greg J. Wolff, K. Venkatesh Prasad, David G. Stork, and Marcus E. Hennecke. Lipreading by neural networks: Visual preprocessing, learning and sensory integration. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 1027–1034. Morgan Kaufmann, 1994.
Ben P. Yuhas, Moise H. Goldstein, Terence J. Sejnowski, and Robert E. Jenkins. Neural network models of sensory integration for improved vowel recognition. Proc. IEEE, 78 (10): 1658–1668, October 1990.
Alan L. Yuille, David S. Cohen, and Peter W. Hallinan. Facial feature extraction by deformable templates. Technical Report 88–2, Harvard Robotics Laboratory, 1988.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1997 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Bregler, C., Omohundro, S.M. (1997). Learning Visual Models for Lipreading. In: Shah, M., Jain, R. (eds) Motion-Based Recognition. Computational Imaging and Vision, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-8935-2_13
Download citation
DOI: https://doi.org/10.1007/978-94-015-8935-2_13
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-4870-7
Online ISBN: 978-94-015-8935-2
eBook Packages: Springer Book Archive