Abstract
This contribution is about experiments in audio-visual isolated words recognition. The results of these experiments will be used to improve our voice dialogue system, where visual speech recognition will be added. The voice dialogue systems can be used in train or bus stations (or elsewhere), where noise levels are relatively high, therefore the visual part of speech can improve the recognition rate mainly in noisy conditions. The audio-visual recognition of isolated words in our experiments was based on the technique of two-stream Hidden Markov Models (HMM) and on the HMM of single Czech phonemes and visemes. Different visual speech features and a different number of states and mixtures of HMM were evaluated in single tests. In the following experiments, isolated words were being recognized after training of the HMM and babble noise was added in the successive steps to the acoustic speech signal.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chaloupka, J., Chaloupka, Z.: Czech Artificial Computerized Talking Head George. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. LNCS (LNAI), vol. 5641, pp. 324–330. Springer, Heidelberg (2009)
Viola, P., Jones, M.: J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306–1326 (2003)
Liew, A.W.C., Wang, S.: Visual speech recognition – lip segmentation and mapping. Medical Information Science Reference Press, New York (2009)
Heckmann, M., Kroschel, K., Savariaux, C., Berthommier, F.: DCT-based video features for audio-visual speech recognition. In: Proc. Int. Conf. Spoken Lang. Process. (2002)
Goecke, R., Asthana, A.: A Comparative Study of 2D and 3D Lip Tracking Methods for AV ASR. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP 2008), Australia, pp. 235–240 (2008) ISBN 978-0-646-49504-0
Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving Visual Features for Lip-reading. In: The 9th International Conference on Auditory-Visual Speech Processing - AVSP 2010, Japan, pp. 142–147 (September 2010) ISBN 978-4-9905475-0-9
Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D.: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep., Speech Research Unit, Defence Research Agency, Malvern, UK (1992)
Zhao, D.Y., Kleijn, W.B., Ypma, A., de Vries, B.: Online Noise Estimation Using Stochastic-Gain HMM for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing 16(4), 835–846 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chaloupka, J. (2011). Audio-Visual Isolated Words Recognition for Voice Dialogue System. In: Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., Nijholt, A. (eds) Analysis of Verbal and Nonverbal Communication and Enactment. The Processing Issues. Lecture Notes in Computer Science, vol 6800. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25775-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-25775-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25774-2
Online ISBN: 978-3-642-25775-9
eBook Packages: Computer ScienceComputer Science (R0)