Audio-Visual Speech Processing for Human Computer Interaction

Chin, Siew Wen; Seng, Kah Phooi; Ang, Li-Minn

doi:10.1007/978-3-642-23363-0_6

Siew Wen Chin⁵,
Kah Phooi Seng⁵ &
Li-Minn Ang⁵

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 26))

2051 Accesses
8 Citations

Abstract

This chapter presents an audio-visual speech recognition (AVSR) for Human Computer Interaction (HCI) that mainly focuses on 3 modules: (i) the radial basis function neural network (RBF-NN) voice activity detection (VAD) (ii) the watershed lips detection and H∞ lips tracking and (iii) the multi-stream audio-visual back-end processing. The importance of the AVSR as the pipeline for the HCI and the background studies of the respective modules are first discussed follow by the design details of the overall proposed AVSR system. Compared to the conventional lips detection approach which needs a prerequisite skin/non-skin detection and face localization, the proposed watershed lips detection with the aid of H∞ lips tracking approach provides a potentially time saving direct lips detection technique, rendering the preliminary criterion obsolete. Alternatively, with a better noise compensation and a more precise speech localization offered by the proposed RBF-NN VAD compared to the conventional zero-crossing rate and short-term signal energy, it has yield to a higher performance capability for the recognition process through the audio modality. Lastly, the developed AVSR system which integrates the audio and visual information, as well the temporal synchrony audiovisual data stream has proved to obtain a significant improvement compared to the unimodal speech recognition, also the decision and feature integration approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yoshida, T., et al.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: 9th IEEE-RAS International Conference on Humanoid Robots, Humanoids 2009, pp. 604–609 (2009)
Google Scholar
Guan, L., et al.: Toward natural and efficient human computer interaction. Presented at the Proceedings of the 2009 IEEE international conference on Multimedia and Expo., New York, NY, USA (2009)
Google Scholar
Hao, T., et al.: Humanoid Audio \& Visual Avatar With Emotive Text-to-Speech Synthesis. IEEE Transactions on Multimedia 10, 969–981 (2008)
Article Google Scholar
Rabiner, L.R., Sambur, M.R.: Algorithm for determining the endpoints of isolated utterances. The Journal of the Acoustical Society of America 56, S31 (1974)
Article Google Scholar
Bachu, R.G., et al.: Voiced/Unvoiced Decision for Speech Signals Based on Zero- Crossing Rate and Energy. In: Elleithy, K. (ed.) Advanced Techniques in Computing Sciences and Software Engineering, pp. 279–282. Springer, Netherlands (2010)
Chapter Google Scholar
Chakrabartty, S., et al.: Robust speech feature extraction by growth transformation in reproducing kernel Hilbert space. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (ICASSP 2004), vol. 1, pp. I-133–I-136 (2004)
Google Scholar
Satya, D., et al.: Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method. IEEE Transactions on Audio, Speech, and Language Processing 15, 224–234 (2007)
Article Google Scholar
Zheng, J., et al.: Modified Local Discriminant Bases and Its Application in Audio Feature Extraction. In: International Forum on Information Technology and Applications, IFITA 2009, pp. 49–52 (2009)
Google Scholar
Umapathy, K., et al.: Audio Signal Feature Extraction and Classification Using Local Discriminant Bases. IEEE Transactions on Audio, Speech, and Language Processing 15, 1236–1246 (2007)
Article Google Scholar
Delphin-Poulat, L.: Robust speech recognition techniques evaluation for telephony server based in-car applications. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP 2004), vol. 1, p. I-65–I-68 (2004)
Google Scholar
Chazan, D., et al.: Speech reconstruction from mel frequency cepstral coefficients and pitch frequency. In: Proceedings. 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 3, pp. 1299–1302 (2000)
Google Scholar
Denbigh, P.: System analysis and signal processing with emphasis on the use of MATLAB. Addison Wesley Longman Ltd, Amsterdam (1998)
Google Scholar
Zhuo, F., et al.: Use Hamming window for detection the harmonic current based on instantaneous reactive power theory. In: The 4th International Power Electronics and Motion Control Conference, IPEMC 2004, vol. 2, pp. 456–461 (2004)
Google Scholar
Song, Y., Peng, X.: Spectra Analysis of Sampling and Reconstructing Continuous Signal Using Hamming Window Function. Presented at the Proceedings of the 2008 Fourth International Conference on Natural Computation, vol. 07 (2008)
Google Scholar
Shah, J.K., Iyer, A.N.: Robust voice/unvoiced classification using novel featuresand Gaussian Mixture Model. Temple University, Philadelphia, USA (2004)
Google Scholar
Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing. In: Computational Linguistics and Speech Recognition. Prentice Hall, Englewood Cliffs (2008)
Google Scholar
Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 583–598 (1991)
Article Google Scholar
Osma-Ruiz, V., et al.: An improved watershed algorithm based on efficient computation of shortest paths. Pattern Recogn. 40, 1078–1090 (2007)
Google Scholar
Aja-Fern, S., et al.: A fuzzy-controlled Kalman filter applied to stereo-visual tracking schemes. Signal Process 83, 101–120 (2003)
Article Google Scholar
Canton-Ferrer, C., et al.: Projective Kalman Filter: Multiocular Tracking of 3D Locations Towards Scene Understanding. In: Machine Learning for Multimodal Interaction, pp. 250–261 (2006)
Google Scholar
Maghami, M., et al.: Kalman filter tracking for facial expression recognition using noticeable feature selection. In: International Conference on Intelligent and Advanced Systems, ICIAS 2007, pp. 587–590 (2007)
Google Scholar
Chieh-Cheng, C., et al.: A Robust Speech Enhancement System for Vehicular Applications Using H∞ Adaptive Filtering. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, pp. 2541–2546 (2006)
Google Scholar
Shen, X.M., Deng, L.: Game theory approach to discrete H∞ filter design. IEEE Transactions on Signal Processing 45, 1092–1095 (1997)
Article Google Scholar
Dan, S.: Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches. Wiley-Interscience, Hoboken (2006)
Google Scholar
Eveno, N., et al.: New color transformation for lips segmentation. In: 2001 IEEE Fourth Workshop on Multimedia Signal Processing, pp. 3–8 (2001)
Google Scholar
Hurlbert, A., Poggio, T.: Synthesizing a color algorithm from examples. Science 239, 482–485 (1988)
Article Google Scholar
Yau, W.C., et al.: Visual recognition of speech consonants using facial movement features. Integr. Comput.-Aided Eng. 14, 49–61 (2007)
Google Scholar
Harvey, R., et al.: Lip reading from scale-space measurements. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 582–587 (1997)
Google Scholar
Xiaopeng, H., et al.: A PCA Based Visual DCT Feature Extraction Method for Lip-Reading. In: International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2006, pp. 321–326 (2006)
Google Scholar
Peng, L., Zuoying, W.: Visual information assisted Mandarin large vocabulary continuous speech recognition. In: Proceedings. 2003 International Conference on Natural Language Processing and Knowledge Engineering, pp. 72–77 (2003)
Google Scholar
Potamianos, G., et al.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91, 1306–1326 (2003)
Article Google Scholar
Seyedin, S., Ahadi, M.: Feature extraction based on DCT and MVDR spectral estimation for robust speech recognition. In: 9th International Conference on Signal Processing, ICSP 2008, pp. 605–608 (2008)
Google Scholar
Wu, J.-D., Lin, B.-F.: Speaker identification using discrete wavelet packet transform technique with irregular decomposition. Expert Syst. Appl. 36, 3136–3143 (2009)
Article MathSciNet Google Scholar
Nefian, A.V., et al.: A coupled HMM for audio-visual speech recognition. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2013–2016 (2002)
Google Scholar
Guocan, F., Jianmin, J.: Image spatial transformation in DCT domain. In: Proceedings. 2001 International Conference on Image Processing, vol. 3, pp. 836–839 (2001)
Google Scholar
Hao, X., et al.: Lifting-Based Directional DCT-Like Transform for Image Coding. IEEE Transactions on Circuits and Systems for Video Technology 17, 1325–1335 (2007)
Article Google Scholar
Kaynak, M.N., et al.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Transactions on Systems, Man and Cybernetics, Part A 34, 564–570 (2004)
Article Google Scholar
Meynet, J., Thiran, J.-P.: Audio-Visual Speech Recognition With A Hybrid SVM-HMM System. Presented at the 13th European Signal Processing Conference (2005)
Google Scholar
Teissier, P., et al.: Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Transactions on Speech and Audio Processing 7, 629–642 (1999)
Article Google Scholar
Potamianos, G., et al.: An image transform approach for HMM based automatic lipreading. In: Proceedings. 1998 International Conference on Image Processing, ICIP 1998, vol. 3, pp. 173–177 (1998)
Google Scholar
Neti, G.P.C., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-Visual Speech Recognition. The John Hopkins University, Baltimore (2000)
Google Scholar
Heckmann, F.B.M., Kroschel, K.: A hybrid ANN/HMM audio-visual speech recognition system. In: Presented at the International Conference on Audio-Visual Speech Processing (2001)
Google Scholar
Yu, K., et al.: Sentence lipreading using hidden Markov model with integrated grammar. In: Hidden Markov models: applications in computer vision, pp. 161–176. World Scientific Publishing Co., Inc., Singapore (2002)
Google Scholar
Çetingül, H.E.: Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing 86, 3549–3558 (2006)
Google Scholar
Yau, W., et al.: Visual Speech Recognition Using Motion Features and Hidden Markov Models. Computer Analysis of Images and Patterns, 832–839 (2007)
Google Scholar
Yuhas, B.P., et al.: Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine 27, 65–71 (1989)
Article Google Scholar
Meier, U., et al.: Adaptive bimodal sensor fusion for automatic speechreading. In: IEEE International Conference Proceedings - Presented at the Proceedings of the Acoustics, Speech, and Signal Processing, vol. 02 (1996)
Google Scholar
Gordan, M., et al.: Application of support vector machines classifiers to visual speech recognition. In: Proceedings. 2002 International Conference on Image Processing, vol. 3, pP. III-129–III-132(2002)
Google Scholar
Saenko, K., et al.: Articulatory features for robust visual speech recognition. Presented at the Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, USA (2004)
Google Scholar
Zhao, G., et al.: Local spatiotemporal descriptors for visual recognition of spoken phrases. Presented at the Proceedings of the International Workshop on Human-centered Multimedia, Augsburg, Bavaria, Germany (2007)
Google Scholar
Rabiner, L., Juang, B.H.: Fundamental of speech recognition. Prentice-Hall, Upper Saddle River (1993)
Google Scholar
Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40, 2325–2340 (2007)
Article MATH Google Scholar
Nefian, A.V., Lu Hong, L.: Bayesian networks in multimodal speech recognition and speaker identification. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 2004–2008 (2003)
Google Scholar
Xie, L., Liu, Z.-Q.: Multi-stream Articulator Model with Adaptive Reliability Measure for Audio Visual Speech Recognition. In: Advances in Machine Learning and Cybernetics, pp. 994–1004 (2006)
Google Scholar
Luettin, J., et al.: Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 169–172 (2001)
Google Scholar
Marcheret, E., et al.: Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, pp. IV-945–IV-948 (2007)
Google Scholar
Dean, D.B., et al.: Fused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition (2007)
Google Scholar
Dean, D.B., et al.: Fused HMM-Adaptation of Synchronous HMMs for Audio-Visual Speech Recognition (2008)
Google Scholar
Kumatani, K., et al.: An adaptive integration based on product hmm for audio-visual speech recognition. In: IEEE International Conference on Multimedia and Expo, ICME 2001, pp. 813–816 (2001)
Google Scholar
Lee, A., et al.: Gaussian mixture selection using context-independent HMM. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 69–72 (2001)
Google Scholar
Seng Kah, P., Ang, L.M.: Adaptive RBF Neural Network Training Algorithm For Nonlinear And Nonstationary Signal. In: International Conference on Computational Intelligence and Security, pp. 433–436 (2006)
Google Scholar
Sinha, S., Routh, P.S., Anno, P.D., Castagna, J.P.: Spectral decomposition of seismic data with continuous-wavelet transforms. Geophysics 70, 19–25 (2005)
Google Scholar
Lab, I.M.: Asian Face Image Database PF01. Pohang University of Science and Technology
Google Scholar
Brand, M., et al.: Coupled hidden Markov models for complex action recognition. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 994–999 (1997)
Google Scholar
Nefian, A., et al.: A Bayesian Approach to Audio-Visual Speaker Identification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 1056–1056. Springer, Heidelberg (2003)
Chapter Google Scholar
Patterson, E., et al.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2017–2020 (2002)
Google Scholar
Ramírez, J., Górriz, J.M., Segura, J.C.: Voice Activity Detection. Fundamentals and Speech Recognition System Robustness (Robust Speech Recognition and Understanding) (2007)
Google Scholar
Tomi Kinnunen, E.C., Tuononen, M., Franti, P., Li, H.: Voice Activity detection Using MFCC Features and Support Vector Machine. In: SPECOM (2007)
Google Scholar
Joachims, T.: SVM light (2008), http://svmlight.joachims.org/
Gurban, M.: Multimodal feature extraction and fusion for audio-visual speech recognition. Programme Doctoral En Informatique, Communications Et Information, Signal Processing Laboratory(LTS5), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland (2009)
Google Scholar
Liew, A.W.C., et al.: Segmentation of color lip images by spatial fuzzy clustering. IEEE Transactions on Fuzzy Systems 11, 542–549 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

The University of Nottingham, Malaysia Campus, Malaysia
Siew Wen Chin, Kah Phooi Seng & Li-Minn Ang

Authors

Siew Wen Chin
View author publications
You can also search for this author in PubMed Google Scholar
Kah Phooi Seng
View author publications
You can also search for this author in PubMed Google Scholar
Li-Minn Ang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, COMSATS Institute of Information Technology, M.A. Jinnah Campus Defence Road, Lahore, Pakistan
Tauseef Gulrez
Faculty of Computer and Information Information Technology Department, Cairo University, 5 Ahmed Zewal St., Giza, Orman, Egypt
Aboul Ella Hassanien

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chin, S.W., Seng, K.P., Ang, LM. (2012). Audio-Visual Speech Processing for Human Computer Interaction. In: Gulrez, T., Hassanien, A.E. (eds) Advances in Robotics and Virtual Reality. Intelligent Systems Reference Library, vol 26. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23363-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-23363-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23362-3
Online ISBN: 978-3-642-23363-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics