Skip to main content

Audio-Visual Speech Processing for Human Computer Interaction

  • Chapter
Advances in Robotics and Virtual Reality

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 26))

Abstract

This chapter presents an audio-visual speech recognition (AVSR) for Human Computer Interaction (HCI) that mainly focuses on 3 modules: (i) the radial basis function neural network (RBF-NN) voice activity detection (VAD) (ii) the watershed lips detection and H∞ lips tracking and (iii) the multi-stream audio-visual back-end processing. The importance of the AVSR as the pipeline for the HCI and the background studies of the respective modules are first discussed follow by the design details of the overall proposed AVSR system. Compared to the conventional lips detection approach which needs a prerequisite skin/non-skin detection and face localization, the proposed watershed lips detection with the aid of H∞ lips tracking approach provides a potentially time saving direct lips detection technique, rendering the preliminary criterion obsolete. Alternatively, with a better noise compensation and a more precise speech localization offered by the proposed RBF-NN VAD compared to the conventional zero-crossing rate and short-term signal energy, it has yield to a higher performance capability for the recognition process through the audio modality. Lastly, the developed AVSR system which integrates the audio and visual information, as well the temporal synchrony audiovisual data stream has proved to obtain a significant improvement compared to the unimodal speech recognition, also the decision and feature integration approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yoshida, T., et al.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: 9th IEEE-RAS International Conference on Humanoid Robots, Humanoids 2009, pp. 604–609 (2009)

    Google Scholar 

  2. Guan, L., et al.: Toward natural and efficient human computer interaction. Presented at the Proceedings of the 2009 IEEE international conference on Multimedia and Expo., New York, NY, USA (2009)

    Google Scholar 

  3. Hao, T., et al.: Humanoid Audio \& Visual Avatar With Emotive Text-to-Speech Synthesis. IEEE Transactions on Multimedia 10, 969–981 (2008)

    Article  Google Scholar 

  4. Rabiner, L.R., Sambur, M.R.: Algorithm for determining the endpoints of isolated utterances. The Journal of the Acoustical Society of America 56, S31 (1974)

    Article  Google Scholar 

  5. Bachu, R.G., et al.: Voiced/Unvoiced Decision for Speech Signals Based on Zero- Crossing Rate and Energy. In: Elleithy, K. (ed.) Advanced Techniques in Computing Sciences and Software Engineering, pp. 279–282. Springer, Netherlands (2010)

    Chapter  Google Scholar 

  6. Chakrabartty, S., et al.: Robust speech feature extraction by growth transformation in reproducing kernel Hilbert space. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (ICASSP 2004), vol. 1, pp. I-133–I-136 (2004)

    Google Scholar 

  7. Satya, D., et al.: Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method. IEEE Transactions on Audio, Speech, and Language Processing 15, 224–234 (2007)

    Article  Google Scholar 

  8. Zheng, J., et al.: Modified Local Discriminant Bases and Its Application in Audio Feature Extraction. In: International Forum on Information Technology and Applications, IFITA 2009, pp. 49–52 (2009)

    Google Scholar 

  9. Umapathy, K., et al.: Audio Signal Feature Extraction and Classification Using Local Discriminant Bases. IEEE Transactions on Audio, Speech, and Language Processing 15, 1236–1246 (2007)

    Article  Google Scholar 

  10. Delphin-Poulat, L.: Robust speech recognition techniques evaluation for telephony server based in-car applications. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP 2004), vol. 1, p. I-65–I-68 (2004)

    Google Scholar 

  11. Chazan, D., et al.: Speech reconstruction from mel frequency cepstral coefficients and pitch frequency. In: Proceedings. 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 3, pp. 1299–1302 (2000)

    Google Scholar 

  12. Denbigh, P.: System analysis and signal processing with emphasis on the use of MATLAB. Addison Wesley Longman Ltd, Amsterdam (1998)

    Google Scholar 

  13. Zhuo, F., et al.: Use Hamming window for detection the harmonic current based on instantaneous reactive power theory. In: The 4th International Power Electronics and Motion Control Conference, IPEMC 2004, vol. 2, pp. 456–461 (2004)

    Google Scholar 

  14. Song, Y., Peng, X.: Spectra Analysis of Sampling and Reconstructing Continuous Signal Using Hamming Window Function. Presented at the Proceedings of the 2008 Fourth International Conference on Natural Computation, vol. 07 (2008)

    Google Scholar 

  15. Shah, J.K., Iyer, A.N.: Robust voice/unvoiced classification using novel featuresand Gaussian Mixture Model. Temple University, Philadelphia, USA (2004)

    Google Scholar 

  16. Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing. In: Computational Linguistics and Speech Recognition. Prentice Hall, Englewood Cliffs (2008)

    Google Scholar 

  17. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 583–598 (1991)

    Article  Google Scholar 

  18. Osma-Ruiz, V., et al.: An improved watershed algorithm based on efficient computation of shortest paths. Pattern Recogn. 40, 1078–1090 (2007)

    Google Scholar 

  19. Aja-Fern, S., et al.: A fuzzy-controlled Kalman filter applied to stereo-visual tracking schemes. Signal Process 83, 101–120 (2003)

    Article  Google Scholar 

  20. Canton-Ferrer, C., et al.: Projective Kalman Filter: Multiocular Tracking of 3D Locations Towards Scene Understanding. In: Machine Learning for Multimodal Interaction, pp. 250–261 (2006)

    Google Scholar 

  21. Maghami, M., et al.: Kalman filter tracking for facial expression recognition using noticeable feature selection. In: International Conference on Intelligent and Advanced Systems, ICIAS 2007, pp. 587–590 (2007)

    Google Scholar 

  22. Chieh-Cheng, C., et al.: A Robust Speech Enhancement System for Vehicular Applications Using H∞ Adaptive Filtering. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, pp. 2541–2546 (2006)

    Google Scholar 

  23. Shen, X.M., Deng, L.: Game theory approach to discrete H∞ filter design. IEEE Transactions on Signal Processing 45, 1092–1095 (1997)

    Article  Google Scholar 

  24. Dan, S.: Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches. Wiley-Interscience, Hoboken (2006)

    Google Scholar 

  25. Eveno, N., et al.: New color transformation for lips segmentation. In: 2001 IEEE Fourth Workshop on Multimedia Signal Processing, pp. 3–8 (2001)

    Google Scholar 

  26. Hurlbert, A., Poggio, T.: Synthesizing a color algorithm from examples. Science 239, 482–485 (1988)

    Article  Google Scholar 

  27. Yau, W.C., et al.: Visual recognition of speech consonants using facial movement features. Integr. Comput.-Aided Eng. 14, 49–61 (2007)

    Google Scholar 

  28. Harvey, R., et al.: Lip reading from scale-space measurements. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 582–587 (1997)

    Google Scholar 

  29. Xiaopeng, H., et al.: A PCA Based Visual DCT Feature Extraction Method for Lip-Reading. In: International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2006, pp. 321–326 (2006)

    Google Scholar 

  30. Peng, L., Zuoying, W.: Visual information assisted Mandarin large vocabulary continuous speech recognition. In: Proceedings. 2003 International Conference on Natural Language Processing and Knowledge Engineering, pp. 72–77 (2003)

    Google Scholar 

  31. Potamianos, G., et al.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91, 1306–1326 (2003)

    Article  Google Scholar 

  32. Seyedin, S., Ahadi, M.: Feature extraction based on DCT and MVDR spectral estimation for robust speech recognition. In: 9th International Conference on Signal Processing, ICSP 2008, pp. 605–608 (2008)

    Google Scholar 

  33. Wu, J.-D., Lin, B.-F.: Speaker identification using discrete wavelet packet transform technique with irregular decomposition. Expert Syst. Appl. 36, 3136–3143 (2009)

    Article  MathSciNet  Google Scholar 

  34. Nefian, A.V., et al.: A coupled HMM for audio-visual speech recognition. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2013–2016 (2002)

    Google Scholar 

  35. Guocan, F., Jianmin, J.: Image spatial transformation in DCT domain. In: Proceedings. 2001 International Conference on Image Processing, vol. 3, pp. 836–839 (2001)

    Google Scholar 

  36. Hao, X., et al.: Lifting-Based Directional DCT-Like Transform for Image Coding. IEEE Transactions on Circuits and Systems for Video Technology 17, 1325–1335 (2007)

    Article  Google Scholar 

  37. Kaynak, M.N., et al.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Transactions on Systems, Man and Cybernetics, Part A 34, 564–570 (2004)

    Article  Google Scholar 

  38. Meynet, J., Thiran, J.-P.: Audio-Visual Speech Recognition With A Hybrid SVM-HMM System. Presented at the 13th European Signal Processing Conference (2005)

    Google Scholar 

  39. Teissier, P., et al.: Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Transactions on Speech and Audio Processing 7, 629–642 (1999)

    Article  Google Scholar 

  40. Potamianos, G., et al.: An image transform approach for HMM based automatic lipreading. In: Proceedings. 1998 International Conference on Image Processing, ICIP 1998, vol. 3, pp. 173–177 (1998)

    Google Scholar 

  41. Neti, G.P.C., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-Visual Speech Recognition. The John Hopkins University, Baltimore (2000)

    Google Scholar 

  42. Heckmann, F.B.M., Kroschel, K.: A hybrid ANN/HMM audio-visual speech recognition system. In: Presented at the International Conference on Audio-Visual Speech Processing (2001)

    Google Scholar 

  43. Yu, K., et al.: Sentence lipreading using hidden Markov model with integrated grammar. In: Hidden Markov models: applications in computer vision, pp. 161–176. World Scientific Publishing Co., Inc., Singapore (2002)

    Google Scholar 

  44. Çetingül, H.E.: Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing 86, 3549–3558 (2006)

    Google Scholar 

  45. Yau, W., et al.: Visual Speech Recognition Using Motion Features and Hidden Markov Models. Computer Analysis of Images and Patterns, 832–839 (2007)

    Google Scholar 

  46. Yuhas, B.P., et al.: Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine 27, 65–71 (1989)

    Article  Google Scholar 

  47. Meier, U., et al.: Adaptive bimodal sensor fusion for automatic speechreading. In: IEEE International Conference Proceedings - Presented at the Proceedings of the Acoustics, Speech, and Signal Processing, vol. 02 (1996)

    Google Scholar 

  48. Gordan, M., et al.: Application of support vector machines classifiers to visual speech recognition. In: Proceedings. 2002 International Conference on Image Processing, vol. 3, pP. III-129–III-132(2002)

    Google Scholar 

  49. Saenko, K., et al.: Articulatory features for robust visual speech recognition. Presented at the Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, USA (2004)

    Google Scholar 

  50. Zhao, G., et al.: Local spatiotemporal descriptors for visual recognition of spoken phrases. Presented at the Proceedings of the International Workshop on Human-centered Multimedia, Augsburg, Bavaria, Germany (2007)

    Google Scholar 

  51. Rabiner, L., Juang, B.H.: Fundamental of speech recognition. Prentice-Hall, Upper Saddle River (1993)

    Google Scholar 

  52. Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40, 2325–2340 (2007)

    Article  MATH  Google Scholar 

  53. Nefian, A.V., Lu Hong, L.: Bayesian networks in multimodal speech recognition and speaker identification. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 2004–2008 (2003)

    Google Scholar 

  54. Xie, L., Liu, Z.-Q.: Multi-stream Articulator Model with Adaptive Reliability Measure for Audio Visual Speech Recognition. In: Advances in Machine Learning and Cybernetics, pp. 994–1004 (2006)

    Google Scholar 

  55. Luettin, J., et al.: Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 169–172 (2001)

    Google Scholar 

  56. Marcheret, E., et al.: Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, pp. IV-945–IV-948 (2007)

    Google Scholar 

  57. Dean, D.B., et al.: Fused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition (2007)

    Google Scholar 

  58. Dean, D.B., et al.: Fused HMM-Adaptation of Synchronous HMMs for Audio-Visual Speech Recognition (2008)

    Google Scholar 

  59. Kumatani, K., et al.: An adaptive integration based on product hmm for audio-visual speech recognition. In: IEEE International Conference on Multimedia and Expo, ICME 2001, pp. 813–816 (2001)

    Google Scholar 

  60. Lee, A., et al.: Gaussian mixture selection using context-independent HMM. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 69–72 (2001)

    Google Scholar 

  61. Seng Kah, P., Ang, L.M.: Adaptive RBF Neural Network Training Algorithm For Nonlinear And Nonstationary Signal. In: International Conference on Computational Intelligence and Security, pp. 433–436 (2006)

    Google Scholar 

  62. Sinha, S., Routh, P.S., Anno, P.D., Castagna, J.P.: Spectral decomposition of seismic data with continuous-wavelet transforms. Geophysics 70, 19–25 (2005)

    Google Scholar 

  63. Lab, I.M.: Asian Face Image Database PF01. Pohang University of Science and Technology

    Google Scholar 

  64. Brand, M., et al.: Coupled hidden Markov models for complex action recognition. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 994–999 (1997)

    Google Scholar 

  65. Nefian, A., et al.: A Bayesian Approach to Audio-Visual Speaker Identification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 1056–1056. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  66. Patterson, E., et al.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2017–2020 (2002)

    Google Scholar 

  67. Ramírez, J., Górriz, J.M., Segura, J.C.: Voice Activity Detection. Fundamentals and Speech Recognition System Robustness (Robust Speech Recognition and Understanding) (2007)

    Google Scholar 

  68. Tomi Kinnunen, E.C., Tuononen, M., Franti, P., Li, H.: Voice Activity detection Using MFCC Features and Support Vector Machine. In: SPECOM (2007)

    Google Scholar 

  69. Joachims, T.: SVM light (2008), http://svmlight.joachims.org/

  70. Gurban, M.: Multimodal feature extraction and fusion for audio-visual speech recognition. Programme Doctoral En Informatique, Communications Et Information, Signal Processing Laboratory(LTS5), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland (2009)

    Google Scholar 

  71. Liew, A.W.C., et al.: Segmentation of color lip images by spatial fuzzy clustering. IEEE Transactions on Fuzzy Systems 11, 542–549 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 IFIP

About this chapter

Cite this chapter

Chin, S.W., Seng, K.P., Ang, LM. (2012). Audio-Visual Speech Processing for Human Computer Interaction. In: Gulrez, T., Hassanien, A.E. (eds) Advances in Robotics and Virtual Reality. Intelligent Systems Reference Library, vol 26. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23363-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23363-0_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23362-3

  • Online ISBN: 978-3-642-23363-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics