Abstract
The automatic speech recognition (ASR) systems benefit from visual modality to improve its performance especially in noisy environments. By combining acoustic features with the visual features, audiovisual speech recognition (AVSR) system could be implemented. This paper presents a review on various existing and recent techniques for AVSR. A special emphasis was placed on recent AVSR system fusion technique, where the AVSR systems fusion stages (early, intermediate and late integration) are discussed with their corresponding models. The aim of this study is to discuss different AVSR approach and compare the existing AVSR techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746 (1976)
Aleksic, P.S., Katsaggelos, A.K.: Audio-visual biometrics. Proc. IEEE 94, 2025–2044 (2006)
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16, 345–379 (2010)
Katsaggelos, A.K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103, 1635–1653 (2015)
Addarrazi, I., Satori, H., Satori, K.: Amazigh audiovisual speech recognition system design. In: 2017 Intelligent Systems and Computer Vision (ISCV), pp. 1–5. IEEE (2017)
Satori, H., El Haoussi, F.: Investigation Amazigh speech recognition using CMU tools. Int. J. Speech Technol. 17, 235–243 (2014)
Satori, H., Zealouk, O., Satori, K., ElHaoussi, F.: Voice comparison between smokers and non-smokers using HMM speech recognition system. Int. J. Speech Technol. 20, 771–777 (2017)
Zealouk, O., Satori, H., Hamidi, M., Laaidi, N., Satori, K.: Vocal parameters analysis of smoker using Amazigh language. Int. J. Speech Technol. 21, 85–91 (2018)
Gupta, K., Gupta, D.: An analysis on LPC, RASTA and MFCC techniques in automatic speech recognition system. In: 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), pp. 493–497. IEEE (2016)
Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1, 1–4 (2013)
Upadhyaya, P., Farooq, O., Abidi, M.R., Varshney, P.: Comparative study of visual feature for bimodal Hindi speech recognition. Arch. Acoust. 40, 609–619 (2015)
Morade, S.S., Patnaik, S.: A novel lip reading algorithm by using localized ACM and HMM: tested for digit recognition. Optik 125, 5181–5186 (2014)
Aleksic, P.S., Williams, J.J., Wu, Z., Katsaggelos, A.K.: Audio-visual continuous speech recognition using MPEG-4 compliant visual features. In: Proceedings. International Conference on Image Processing, vol. 1, pp. I–I. IEEE (2002)
Paleček, K., Chaloupka, J.: Audio-visual speech recognition in noisy audio environments. In: 2013 36th International Conference on Telecommunications and Signal Processing (TSP), pp. 484–487. IEEE (2013)
Makhlouf, A., Lazli, L., Bensaker, B.: Evolutionary structure of hidden Markov models for audio-visual Arabic speech recognition. Int. J. Signal Imaging Syst. Eng. 9, 55–66 (2016)
Lucey, S., Chen, T., Sridharan, S., Chandran, V.: Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition. IEEE Trans. Multimedia 7, 495–506 (2005)
Sanderson, C., Paliwal, K.K.: Information fusion and person verification using speech and face information. Research Paper IDIAP-RR, pp. 02–33 (2002)
Amarnag, S., Gurbuz, S., Patterson, E., Gowdy, J.N.: Audio-visual speech integration using coupled hidden markov models for continuous speech recognition. In: Student Forum Paper at ICASSP (2003)
Subashini, K., Palanivel, S., Ramalingam, V.: Audio-video based classification using SVM and AANN. Int. J. Comput. Appl. 44(6), 33–39 (2012)
Ibrahim, M.Z., Mulvaney, D.J., Abas, M.F.: Feature-fusion based audio-visual speech recognition using lip geometry features in noisy enviroment. ARPN J. Eng. Appl. Sci. 10, 17521–17527 (2015)
Chelali, F., Djeradi, A.: Audiovisual speaker identification based on lip and speech modalities. Int. Arab J. Inf. Technol. (IAJIT) 14 (2017)
Rahmani, M.H., Almasganj, F., Seyyedsalehi, S.A.: Audio-visual feature fusion via deep neural networks for automatic speech recognition. Digit. Signal Proc. 82, 54–63 (2018)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Addarrazi, I., Satori, H., Satori, K. (2020). A Follow-Up Survey of Audiovisual Speech Integration Strategies. In: Bhateja, V., Satapathy, S., Satori, H. (eds) Embedded Systems and Artificial Intelligence. Advances in Intelligent Systems and Computing, vol 1076. Springer, Singapore. https://doi.org/10.1007/978-981-15-0947-6_60
Download citation
DOI: https://doi.org/10.1007/978-981-15-0947-6_60
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0946-9
Online ISBN: 978-981-15-0947-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)