Visual Speech Analysis for Spoken Chinese Training of Oral Deaf Children

  • Xiaodong Jiang
  • Qianghua Qiang
  • Zhisong Zhou
  • Yunlai Wang
Conference paper
Part of the Eurographics book series (EUROGRAPH)


This paper presents a novel vision-based speech analysis system STODE which is used in spoken Chinese training of oral deaf children. Its design goal is to help oral deaf children overcome two major difficulties in speech learning: the confusion of intonations for spoken Chinese characters and timing errors within different words and characters. It integrates such capabilities as real-time lip tracking and feature extraction, multi-state lip modeling, Time-delay Neural Network (TDNN) for visual speech analysis. A desk-mounted camera tracks users in real-time. At each frame, region of interest is identified and key information is extracted. The preprocessed acoustic and visual information are then fed into a modular TDNN and combined for visual speech analysis. Confusion of intonations for spoken Chinese characters can be easily identified, and timing error within words and characters also can be detected using a DTW (Dynamic Time Warping) algorithm. For visual feedback we have created an artificial talking head directly cloned from user’s own images to generate correct outputs showing both correct and wrong ways of pronunciation. This system has been successfully used for spoken Chinese training of oral deaf children in cooperation with Nanjing Oral School under grants from National Natural Science Foundation of China.


Feature Point Chinese Character Timing Error Dynamic Time Warping External Boundary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37 (3): 328–339, March 1989CrossRefGoogle Scholar
  2. 2.
    Vogt, M. Fast Matching of a Dynamic Lip Model to Color Video Sequences under Regular Illumination Conditions. In Stork & Hennecke, 1996Google Scholar
  3. 3.
    Yullie, A, Cohen, D, Hallinan, P. Facial feature extraction by deformable templates, Harvard Robotics Laboratory, Technical Report No. 88–2, 1988Google Scholar
  4. 4.
    David G. Stork, Greg Wolff, and Earl Levine. Neural Network Lipreading for Improved Speech Recognition. In IJCNN, June 1992Google Scholar
  5. 5.
    P. Haffner, M. Frannzini, and A. Waibel. Integrating Time Alignment and Connectionist Networks for High Performance Continuous Speech Recognition. Proc. International Conference on Acoustics, Speech and Signal Processing, 1991Google Scholar
  6. 6.
    B. Dodd, R. Campbell. Hearing by Eye: the Psychology of Lip-reading. Lawrence Erlbaum Assiciates, Hillsdale, New Jersey. 1987Google Scholar
  7. 7.
    R. Cole et al. Intelligent animated agents for interactive language training, Speech Technology in Language Learning, ESCA Workshop, 1998Google Scholar
  8. 8.
    D. W. Massaro. Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press, 1998Google Scholar

Copyright information

© Springer-Verlag/Wien 2000

Authors and Affiliations

  • Xiaodong Jiang
    • 1
  • Qianghua Qiang
    • 1
  • Zhisong Zhou
    • 1
  • Yunlai Wang
    • 2
  1. 1.Department of Computer ScienceNanjing UniversityNanjingP.R. China
  2. 2.Nanjing Oral SchoolNanjingP.R. China

Personalised recommendations