Visual Recognition of Spoken Phrases

  • Matti Pietikäinen
  • Abdenour Hadid
  • Guoying Zhao
  • Timo Ahonen
Part of the Computational Imaging and Vision book series (CIVI, volume 40)


Visual speech information plays an important role in speech recognition under noisy conditions or for listeners with hearing impairment. In this chapter, local spatiotemporal descriptors are utilized to represent and recognize spoken isolated phrases based solely on visual input. Positions of the eyes are used for localizing the mouth regions in face images and then spatiotemporal local binary patterns extracted from these regions are used for describing phrase sequences. Experiments show promising results. Advantages of the approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.


Recognition Rate Speech Recognition Face Image Local Binary Pattern Support Vector Machine Classifier 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Arsic, I., Thiran, J.P.: Mutual information eigenlips for audio-visual speech. In: Proc. European Signal Processing Conference (2006) Google Scholar
  2. 2.
    Chiou, G.I., Hwang, J.N.: Lipreading from color video. IEEE Trans. Image Process. 6(8), 1192–1195 (1997) CrossRefGoogle Scholar
  3. 3.
    Fox, N., Gross, R., Chazal, P.: Person identification using automatic integration of speech, lip and face experts. In: Proc. ACM SIGMM Workshop on Biometrics Methods and Applications, pp. 25–32 (2003) Google Scholar
  4. 4.
    Frischholz, R.W., Dieckmann, U.: BioID: A multimodal biometric identification system. Computer 33(2), 64–68 (2000) CrossRefGoogle Scholar
  5. 5.
    Gurban, M., Thiran, J.P.: Audio-visual speech recognition with a hybrid SVM-HMM system. In: Proc. European Signal Processing Conference, p. 4 (2005) Google Scholar
  6. 6.
    Luettin, J., Thacher, N.A., Beet, S.W.: Speaker identification by lipreading. In: Proc. International Conference on Spoken Language Proceedings, pp. 62–64 (1996) Google Scholar
  7. 7.
    Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002) CrossRefGoogle Scholar
  8. 8.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976) CrossRefGoogle Scholar
  9. 9.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003) CrossRefGoogle Scholar
  10. 10.
    Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Issues in Visual and Audio-Visual Speech Processing. MIT Press, Cambridge (2004) Google Scholar
  11. 11.
    Saenko, K., Livescu, K., Glass, J., Darrell, T.: Production domain modeling of pronunciation for visual speech recognition. In: Proc. International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 473–476 (2005) Google Scholar
  12. 12.
    Saenko, K., Livescu, K., Siracusa, M., Wilson, K., Glass, J., Darrell, T.: Visual speech recognition with loosely synchronized feature streams. In: Proc. International Conference on Computer Vision, pp. 1424–1431 (2005) Google Scholar
  13. 13.
    Zhao, G., Pietikäinen, M.: Boosted multi-resolution spatiotemporal descriptors for facial expression recognition. Pattern Recognit. Lett. 30(12), 1117–1127 (2009) CrossRefGoogle Scholar
  14. 14.
    Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009) CrossRefGoogle Scholar
  15. 15.
    Zhao, G., Huang, X., Gizatdinova, Y., Pietikäinen, M.: Combining dynamic texture and structural features for speaker identification. In: Proc. ACM Multimedia Workshop on Multimedia in Forensics, Security and Intelligence, pp. 93–98 (2010) CrossRefGoogle Scholar
  16. 16.
    Zhou, Z., Zhao, G., Pietikäinen, M.: Towards a practical lipreading system. In: Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 137–144 (2011) Google Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Matti Pietikäinen
    • 1
  • Abdenour Hadid
    • 1
  • Guoying Zhao
    • 1
  • Timo Ahonen
    • 2
  1. 1.Machine Vision Group, Department of Computer Science and EngineeringUniversity of OuluOuluFinland
  2. 2.Nokia Research CenterPalo AltoUSA

Personalised recommendations