Visual Features Extracting & Selecting for Lipreading

  • Hong-xun Yao
  • Wen Gao
  • Wei Shan
  • Ming-hui Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2688)


This paper has put forward a way to select and extract visual features effectively for lipreading. These features come from both lowlevel and high-level, those are compensatory each other. There are 41 dimensional features to be used for recognition. Tested on a bimodal database AVCC which consists of sentences including all Chinese pronunciation, it achieves an accuracy of 87.8% from 84.1% for automatic speech recognition by lipreading assisting. It improves 19.5% accuracy from 31.7% to 51.2% for speakers dependent and improves 27.7% accuracy from 27.6% to 55.3% for speakers independent when speech recognition under noise conditions. And the paper has proves that visual speech information can reinforce the loss of acoustic information effectively by improving recognition rate from 10% to 30% various with the different amount of noises in speech signals in our system, the improving scope is higher than ASR system of IBM. And it performs better in noisy environments.


Recognition Rate Speech Recognition Visual Feature Automatic Speech Recognition Grey Level Image 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    S. Dupont, J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition”. IEEE Transactions On Multimedia, Vol. 2, No. 3, September 2000.Google Scholar
  2. [2]
    Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, “Extraction of Visual Features for Lipreading”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 2, February 2002.Google Scholar
  3. [3]
    M.E. Hennecke, K.V. Prasad & D.G. Stork, “Using Deformable Templates to Infer Visual Speech Dynamics”. 28th Annual Asilomar Conference on Signals, Systems and Computers, Volume 1, pp578–582, Pacific Grove, CA.IEEE, IEEE Computer Society Press, 1994.Google Scholar
  4. [4]
    H. Yao, W. Gao, J. Li, Y. Lv, R. Wang. Real-time Lip Locating Method for Lip-Movement Recognition, Chinese Journal of Software, 2000, 11(8): 1126–1132.Google Scholar
  5. [5]
    G. Gravier, G. Potamianos, and C. Neti, Asynchrony modeling for audio-visual speech recognition, Proc. Human Language Technology Conference, San Diego, 2002.Google Scholar
  6. [6]
    G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, Proc. Int. Conf. Acoust. Speech Signal Process., Orlando, 2002.Google Scholar
  7. [7]
    C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop, Proc. IEEE Work. Multimedia Signal Process., Cannes, 2001.Google Scholar
  8. [8]
    Matthews, G. Potamianos, C. Neti, and J. Luettin, A comparison of model and transform-based visual features for audio-visual LVCSR, Proc. IEEE Int. Conf. Multimedia Expo., Tokyo, 2001.Google Scholar
  9. [9]
    G. Potamianos, J. Luettin, C. Neti. Hierarchical discriminant features for audiovisual LVCSR, ICASSP, Salt Lake City, May 2001.Google Scholar
  10. [10]
    J. Luettin, G. Potamianos, C. Neti. Asynchronous stream modeling for largevocabulary audio-visual speech recognition, ICASSP, Salt Lake City, May 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Hong-xun Yao
    • 1
  • Wen Gao
    • 1
    • 2
  • Wei Shan
    • 1
  • Ming-hui Xu
    • 1
  1. 1.Department of Computer Science and EngineeringHarbin Institute of TechnologyHarbinChina
  2. 2.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina

Personalised recommendations