Abstract
This chapter presents an audio-visual speech recognition (AVSR) for Human Computer Interaction (HCI) that mainly focuses on 3 modules: (i) the radial basis function neural network (RBF-NN) voice activity detection (VAD) (ii) the watershed lips detection and H∞ lips tracking and (iii) the multi-stream audio-visual back-end processing. The importance of the AVSR as the pipeline for the HCI and the background studies of the respective modules are first discussed follow by the design details of the overall proposed AVSR system. Compared to the conventional lips detection approach which needs a prerequisite skin/non-skin detection and face localization, the proposed watershed lips detection with the aid of H∞ lips tracking approach provides a potentially time saving direct lips detection technique, rendering the preliminary criterion obsolete. Alternatively, with a better noise compensation and a more precise speech localization offered by the proposed RBF-NN VAD compared to the conventional zero-crossing rate and short-term signal energy, it has yield to a higher performance capability for the recognition process through the audio modality. Lastly, the developed AVSR system which integrates the audio and visual information, as well the temporal synchrony audiovisual data stream has proved to obtain a significant improvement compared to the unimodal speech recognition, also the decision and feature integration approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Yoshida, T., et al.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: 9th IEEE-RAS International Conference on Humanoid Robots, Humanoids 2009, pp. 604–609 (2009)
Guan, L., et al.: Toward natural and efficient human computer interaction. Presented at the Proceedings of the 2009 IEEE international conference on Multimedia and Expo., New York, NY, USA (2009)
Hao, T., et al.: Humanoid Audio \& Visual Avatar With Emotive Text-to-Speech Synthesis. IEEE Transactions on Multimedia 10, 969–981 (2008)
Rabiner, L.R., Sambur, M.R.: Algorithm for determining the endpoints of isolated utterances. The Journal of the Acoustical Society of America 56, S31 (1974)
Bachu, R.G., et al.: Voiced/Unvoiced Decision for Speech Signals Based on Zero- Crossing Rate and Energy. In: Elleithy, K. (ed.) Advanced Techniques in Computing Sciences and Software Engineering, pp. 279–282. Springer, Netherlands (2010)
Chakrabartty, S., et al.: Robust speech feature extraction by growth transformation in reproducing kernel Hilbert space. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (ICASSP 2004), vol. 1, pp. I-133–I-136 (2004)
Satya, D., et al.: Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method. IEEE Transactions on Audio, Speech, and Language Processing 15, 224–234 (2007)
Zheng, J., et al.: Modified Local Discriminant Bases and Its Application in Audio Feature Extraction. In: International Forum on Information Technology and Applications, IFITA 2009, pp. 49–52 (2009)
Umapathy, K., et al.: Audio Signal Feature Extraction and Classification Using Local Discriminant Bases. IEEE Transactions on Audio, Speech, and Language Processing 15, 1236–1246 (2007)
Delphin-Poulat, L.: Robust speech recognition techniques evaluation for telephony server based in-car applications. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP 2004), vol. 1, p. I-65–I-68 (2004)
Chazan, D., et al.: Speech reconstruction from mel frequency cepstral coefficients and pitch frequency. In: Proceedings. 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 3, pp. 1299–1302 (2000)
Denbigh, P.: System analysis and signal processing with emphasis on the use of MATLAB. Addison Wesley Longman Ltd, Amsterdam (1998)
Zhuo, F., et al.: Use Hamming window for detection the harmonic current based on instantaneous reactive power theory. In: The 4th International Power Electronics and Motion Control Conference, IPEMC 2004, vol. 2, pp. 456–461 (2004)
Song, Y., Peng, X.: Spectra Analysis of Sampling and Reconstructing Continuous Signal Using Hamming Window Function. Presented at the Proceedings of the 2008 Fourth International Conference on Natural Computation, vol. 07 (2008)
Shah, J.K., Iyer, A.N.: Robust voice/unvoiced classification using novel featuresand Gaussian Mixture Model. Temple University, Philadelphia, USA (2004)
Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing. In: Computational Linguistics and Speech Recognition. Prentice Hall, Englewood Cliffs (2008)
Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 583–598 (1991)
Osma-Ruiz, V., et al.: An improved watershed algorithm based on efficient computation of shortest paths. Pattern Recogn. 40, 1078–1090 (2007)
Aja-Fern, S., et al.: A fuzzy-controlled Kalman filter applied to stereo-visual tracking schemes. Signal Process 83, 101–120 (2003)
Canton-Ferrer, C., et al.: Projective Kalman Filter: Multiocular Tracking of 3D Locations Towards Scene Understanding. In: Machine Learning for Multimodal Interaction, pp. 250–261 (2006)
Maghami, M., et al.: Kalman filter tracking for facial expression recognition using noticeable feature selection. In: International Conference on Intelligent and Advanced Systems, ICIAS 2007, pp. 587–590 (2007)
Chieh-Cheng, C., et al.: A Robust Speech Enhancement System for Vehicular Applications Using H∞ Adaptive Filtering. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2006, pp. 2541–2546 (2006)
Shen, X.M., Deng, L.: Game theory approach to discrete H∞ filter design. IEEE Transactions on Signal Processing 45, 1092–1095 (1997)
Dan, S.: Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches. Wiley-Interscience, Hoboken (2006)
Eveno, N., et al.: New color transformation for lips segmentation. In: 2001 IEEE Fourth Workshop on Multimedia Signal Processing, pp. 3–8 (2001)
Hurlbert, A., Poggio, T.: Synthesizing a color algorithm from examples. Science 239, 482–485 (1988)
Yau, W.C., et al.: Visual recognition of speech consonants using facial movement features. Integr. Comput.-Aided Eng. 14, 49–61 (2007)
Harvey, R., et al.: Lip reading from scale-space measurements. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 582–587 (1997)
Xiaopeng, H., et al.: A PCA Based Visual DCT Feature Extraction Method for Lip-Reading. In: International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2006, pp. 321–326 (2006)
Peng, L., Zuoying, W.: Visual information assisted Mandarin large vocabulary continuous speech recognition. In: Proceedings. 2003 International Conference on Natural Language Processing and Knowledge Engineering, pp. 72–77 (2003)
Potamianos, G., et al.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91, 1306–1326 (2003)
Seyedin, S., Ahadi, M.: Feature extraction based on DCT and MVDR spectral estimation for robust speech recognition. In: 9th International Conference on Signal Processing, ICSP 2008, pp. 605–608 (2008)
Wu, J.-D., Lin, B.-F.: Speaker identification using discrete wavelet packet transform technique with irregular decomposition. Expert Syst. Appl. 36, 3136–3143 (2009)
Nefian, A.V., et al.: A coupled HMM for audio-visual speech recognition. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2013–2016 (2002)
Guocan, F., Jianmin, J.: Image spatial transformation in DCT domain. In: Proceedings. 2001 International Conference on Image Processing, vol. 3, pp. 836–839 (2001)
Hao, X., et al.: Lifting-Based Directional DCT-Like Transform for Image Coding. IEEE Transactions on Circuits and Systems for Video Technology 17, 1325–1335 (2007)
Kaynak, M.N., et al.: Analysis of lip geometric features for audio-visual speech recognition. IEEE Transactions on Systems, Man and Cybernetics, Part A 34, 564–570 (2004)
Meynet, J., Thiran, J.-P.: Audio-Visual Speech Recognition With A Hybrid SVM-HMM System. Presented at the 13th European Signal Processing Conference (2005)
Teissier, P., et al.: Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Transactions on Speech and Audio Processing 7, 629–642 (1999)
Potamianos, G., et al.: An image transform approach for HMM based automatic lipreading. In: Proceedings. 1998 International Conference on Image Processing, ICIP 1998, vol. 3, pp. 173–177 (1998)
Neti, G.P.C., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-Visual Speech Recognition. The John Hopkins University, Baltimore (2000)
Heckmann, F.B.M., Kroschel, K.: A hybrid ANN/HMM audio-visual speech recognition system. In: Presented at the International Conference on Audio-Visual Speech Processing (2001)
Yu, K., et al.: Sentence lipreading using hidden Markov model with integrated grammar. In: Hidden Markov models: applications in computer vision, pp. 161–176. World Scientific Publishing Co., Inc., Singapore (2002)
Çetingül, H.E.: Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing 86, 3549–3558 (2006)
Yau, W., et al.: Visual Speech Recognition Using Motion Features and Hidden Markov Models. Computer Analysis of Images and Patterns, 832–839 (2007)
Yuhas, B.P., et al.: Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine 27, 65–71 (1989)
Meier, U., et al.: Adaptive bimodal sensor fusion for automatic speechreading. In: IEEE International Conference Proceedings - Presented at the Proceedings of the Acoustics, Speech, and Signal Processing, vol. 02 (1996)
Gordan, M., et al.: Application of support vector machines classifiers to visual speech recognition. In: Proceedings. 2002 International Conference on Image Processing, vol. 3, pP. III-129–III-132(2002)
Saenko, K., et al.: Articulatory features for robust visual speech recognition. Presented at the Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, USA (2004)
Zhao, G., et al.: Local spatiotemporal descriptors for visual recognition of spoken phrases. Presented at the Proceedings of the International Workshop on Human-centered Multimedia, Augsburg, Bavaria, Germany (2007)
Rabiner, L., Juang, B.H.: Fundamental of speech recognition. Prentice-Hall, Upper Saddle River (1993)
Xie, L., Liu, Z.-Q.: A coupled HMM approach to video-realistic speech animation. Pattern Recogn. 40, 2325–2340 (2007)
Nefian, A.V., Lu Hong, L.: Bayesian networks in multimodal speech recognition and speaker identification. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 2004–2008 (2003)
Xie, L., Liu, Z.-Q.: Multi-stream Articulator Model with Adaptive Reliability Measure for Audio Visual Speech Recognition. In: Advances in Machine Learning and Cybernetics, pp. 994–1004 (2006)
Luettin, J., et al.: Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 169–172 (2001)
Marcheret, E., et al.: Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, pp. IV-945–IV-948 (2007)
Dean, D.B., et al.: Fused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition (2007)
Dean, D.B., et al.: Fused HMM-Adaptation of Synchronous HMMs for Audio-Visual Speech Recognition (2008)
Kumatani, K., et al.: An adaptive integration based on product hmm for audio-visual speech recognition. In: IEEE International Conference on Multimedia and Expo, ICME 2001, pp. 813–816 (2001)
Lee, A., et al.: Gaussian mixture selection using context-independent HMM. In: Proceedings. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 69–72 (2001)
Seng Kah, P., Ang, L.M.: Adaptive RBF Neural Network Training Algorithm For Nonlinear And Nonstationary Signal. In: International Conference on Computational Intelligence and Security, pp. 433–436 (2006)
Sinha, S., Routh, P.S., Anno, P.D., Castagna, J.P.: Spectral decomposition of seismic data with continuous-wavelet transforms. Geophysics 70, 19–25 (2005)
Lab, I.M.: Asian Face Image Database PF01. Pohang University of Science and Technology
Brand, M., et al.: Coupled hidden Markov models for complex action recognition. In: Proceedings. 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 994–999 (1997)
Nefian, A., et al.: A Bayesian Approach to Audio-Visual Speaker Identification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 1056–1056. Springer, Heidelberg (2003)
Patterson, E., et al.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, pp. 2017–2020 (2002)
Ramírez, J., Górriz, J.M., Segura, J.C.: Voice Activity Detection. Fundamentals and Speech Recognition System Robustness (Robust Speech Recognition and Understanding) (2007)
Tomi Kinnunen, E.C., Tuononen, M., Franti, P., Li, H.: Voice Activity detection Using MFCC Features and Support Vector Machine. In: SPECOM (2007)
Joachims, T.: SVM light (2008), http://svmlight.joachims.org/
Gurban, M.: Multimodal feature extraction and fusion for audio-visual speech recognition. Programme Doctoral En Informatique, Communications Et Information, Signal Processing Laboratory(LTS5), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland (2009)
Liew, A.W.C., et al.: Segmentation of color lip images by spatial fuzzy clustering. IEEE Transactions on Fuzzy Systems 11, 542–549 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 IFIP
About this chapter
Cite this chapter
Chin, S.W., Seng, K.P., Ang, LM. (2012). Audio-Visual Speech Processing for Human Computer Interaction. In: Gulrez, T., Hassanien, A.E. (eds) Advances in Robotics and Virtual Reality. Intelligent Systems Reference Library, vol 26. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23363-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-23363-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23362-3
Online ISBN: 978-3-642-23363-0
eBook Packages: EngineeringEngineering (R0)