A Follow-Up Survey of Audiovisual Speech Integration Strategies

  • Ilham Addarrazi
  • Hassan Satori
  • Khalid Satori
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1076)


The automatic speech recognition (ASR) systems benefit from visual modality to improve its performance especially in noisy environments. By combining acoustic features with the visual features, audiovisual speech recognition (AVSR) system could be implemented. This paper presents a review on various existing and recent techniques for AVSR. A special emphasis was placed on recent AVSR system fusion technique, where the AVSR systems fusion stages (early, intermediate and late integration) are discussed with their corresponding models. The aim of this study is to discuss different AVSR approach and compare the existing AVSR techniques.


Audiovisual speech recognition Automatic speech recognition Lip reading Late integration Early integration Features extraction 


  1. 1.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746 (1976)CrossRefGoogle Scholar
  2. 2.
    Aleksic, P.S., Katsaggelos, A.K.: Audio-visual biometrics. Proc. IEEE 94, 2025–2044 (2006)CrossRefGoogle Scholar
  3. 3.
    Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16, 345–379 (2010)CrossRefGoogle Scholar
  4. 4.
    Katsaggelos, A.K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103, 1635–1653 (2015)CrossRefGoogle Scholar
  5. 5.
    Addarrazi, I., Satori, H., Satori, K.: Amazigh audiovisual speech recognition system design. In: 2017 Intelligent Systems and Computer Vision (ISCV), pp. 1–5. IEEE (2017)Google Scholar
  6. 6.
    Satori, H., El Haoussi, F.: Investigation Amazigh speech recognition using CMU tools. Int. J. Speech Technol. 17, 235–243 (2014)CrossRefGoogle Scholar
  7. 7.
    Satori, H., Zealouk, O., Satori, K., ElHaoussi, F.: Voice comparison between smokers and non-smokers using HMM speech recognition system. Int. J. Speech Technol. 20, 771–777 (2017)CrossRefGoogle Scholar
  8. 8.
    Zealouk, O., Satori, H., Hamidi, M., Laaidi, N., Satori, K.: Vocal parameters analysis of smoker using Amazigh language. Int. J. Speech Technol. 21, 85–91 (2018)CrossRefGoogle Scholar
  9. 9.
    Gupta, K., Gupta, D.: An analysis on LPC, RASTA and MFCC techniques in automatic speech recognition system. In: 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), pp. 493–497. IEEE (2016)Google Scholar
  10. 10.
    Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1, 1–4 (2013)Google Scholar
  11. 11.
    Upadhyaya, P., Farooq, O., Abidi, M.R., Varshney, P.: Comparative study of visual feature for bimodal Hindi speech recognition. Arch. Acoust. 40, 609–619 (2015)CrossRefGoogle Scholar
  12. 12.
    Morade, S.S., Patnaik, S.: A novel lip reading algorithm by using localized ACM and HMM: tested for digit recognition. Optik 125, 5181–5186 (2014)CrossRefGoogle Scholar
  13. 13.
    Aleksic, P.S., Williams, J.J., Wu, Z., Katsaggelos, A.K.: Audio-visual continuous speech recognition using MPEG-4 compliant visual features. In: Proceedings. International Conference on Image Processing, vol. 1, pp. I–I. IEEE (2002)Google Scholar
  14. 14.
    Paleček, K., Chaloupka, J.: Audio-visual speech recognition in noisy audio environments. In: 2013 36th International Conference on Telecommunications and Signal Processing (TSP), pp. 484–487. IEEE (2013)Google Scholar
  15. 15.
    Makhlouf, A., Lazli, L., Bensaker, B.: Evolutionary structure of hidden Markov models for audio-visual Arabic speech recognition. Int. J. Signal Imaging Syst. Eng. 9, 55–66 (2016)CrossRefGoogle Scholar
  16. 16.
    Lucey, S., Chen, T., Sridharan, S., Chandran, V.: Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition. IEEE Trans. Multimedia 7, 495–506 (2005)CrossRefGoogle Scholar
  17. 17.
    Sanderson, C., Paliwal, K.K.: Information fusion and person verification using speech and face information. Research Paper IDIAP-RR, pp. 02–33 (2002)Google Scholar
  18. 18.
    Amarnag, S., Gurbuz, S., Patterson, E., Gowdy, J.N.: Audio-visual speech integration using coupled hidden markov models for continuous speech recognition. In: Student Forum Paper at ICASSP (2003)Google Scholar
  19. 19.
    Subashini, K., Palanivel, S., Ramalingam, V.: Audio-video based classification using SVM and AANN. Int. J. Comput. Appl. 44(6), 33–39 (2012)Google Scholar
  20. 20.
    Ibrahim, M.Z., Mulvaney, D.J., Abas, M.F.: Feature-fusion based audio-visual speech recognition using lip geometry features in noisy enviroment. ARPN J. Eng. Appl. Sci. 10, 17521–17527 (2015)Google Scholar
  21. 21.
    Chelali, F., Djeradi, A.: Audiovisual speaker identification based on lip and speech modalities. Int. Arab J. Inf. Technol. (IAJIT) 14 (2017)Google Scholar
  22. 22.
    Rahmani, M.H., Almasganj, F., Seyyedsalehi, S.A.: Audio-visual feature fusion via deep neural networks for automatic speech recognition. Digit. Signal Proc. 82, 54–63 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Ilham Addarrazi
    • 1
  • Hassan Satori
    • 1
  • Khalid Satori
    • 1
  1. 1.Department of Mathematics and Computer Science FSDMUniversity of Sidi Mohamed Ben AbdllahFezMorocco

Personalised recommendations