Skip to main content

A Follow-Up Survey of Audiovisual Speech Integration Strategies

  • Conference paper
  • First Online:
Embedded Systems and Artificial Intelligence

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1076))

Abstract

The automatic speech recognition (ASR) systems benefit from visual modality to improve its performance especially in noisy environments. By combining acoustic features with the visual features, audiovisual speech recognition (AVSR) system could be implemented. This paper presents a review on various existing and recent techniques for AVSR. A special emphasis was placed on recent AVSR system fusion technique, where the AVSR systems fusion stages (early, intermediate and late integration) are discussed with their corresponding models. The aim of this study is to discuss different AVSR approach and compare the existing AVSR techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746 (1976)

    Article  Google Scholar 

  2. Aleksic, P.S., Katsaggelos, A.K.: Audio-visual biometrics. Proc. IEEE 94, 2025–2044 (2006)

    Article  Google Scholar 

  3. Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16, 345–379 (2010)

    Article  Google Scholar 

  4. Katsaggelos, A.K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103, 1635–1653 (2015)

    Article  Google Scholar 

  5. Addarrazi, I., Satori, H., Satori, K.: Amazigh audiovisual speech recognition system design. In: 2017 Intelligent Systems and Computer Vision (ISCV), pp. 1–5. IEEE (2017)

    Google Scholar 

  6. Satori, H., El Haoussi, F.: Investigation Amazigh speech recognition using CMU tools. Int. J. Speech Technol. 17, 235–243 (2014)

    Article  Google Scholar 

  7. Satori, H., Zealouk, O., Satori, K., ElHaoussi, F.: Voice comparison between smokers and non-smokers using HMM speech recognition system. Int. J. Speech Technol. 20, 771–777 (2017)

    Article  Google Scholar 

  8. Zealouk, O., Satori, H., Hamidi, M., Laaidi, N., Satori, K.: Vocal parameters analysis of smoker using Amazigh language. Int. J. Speech Technol. 21, 85–91 (2018)

    Article  Google Scholar 

  9. Gupta, K., Gupta, D.: An analysis on LPC, RASTA and MFCC techniques in automatic speech recognition system. In: 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), pp. 493–497. IEEE (2016)

    Google Scholar 

  10. Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1, 1–4 (2013)

    Google Scholar 

  11. Upadhyaya, P., Farooq, O., Abidi, M.R., Varshney, P.: Comparative study of visual feature for bimodal Hindi speech recognition. Arch. Acoust. 40, 609–619 (2015)

    Article  Google Scholar 

  12. Morade, S.S., Patnaik, S.: A novel lip reading algorithm by using localized ACM and HMM: tested for digit recognition. Optik 125, 5181–5186 (2014)

    Article  Google Scholar 

  13. Aleksic, P.S., Williams, J.J., Wu, Z., Katsaggelos, A.K.: Audio-visual continuous speech recognition using MPEG-4 compliant visual features. In: Proceedings. International Conference on Image Processing, vol. 1, pp. I–I. IEEE (2002)

    Google Scholar 

  14. Paleček, K., Chaloupka, J.: Audio-visual speech recognition in noisy audio environments. In: 2013 36th International Conference on Telecommunications and Signal Processing (TSP), pp. 484–487. IEEE (2013)

    Google Scholar 

  15. Makhlouf, A., Lazli, L., Bensaker, B.: Evolutionary structure of hidden Markov models for audio-visual Arabic speech recognition. Int. J. Signal Imaging Syst. Eng. 9, 55–66 (2016)

    Article  Google Scholar 

  16. Lucey, S., Chen, T., Sridharan, S., Chandran, V.: Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition. IEEE Trans. Multimedia 7, 495–506 (2005)

    Article  Google Scholar 

  17. Sanderson, C., Paliwal, K.K.: Information fusion and person verification using speech and face information. Research Paper IDIAP-RR, pp. 02–33 (2002)

    Google Scholar 

  18. Amarnag, S., Gurbuz, S., Patterson, E., Gowdy, J.N.: Audio-visual speech integration using coupled hidden markov models for continuous speech recognition. In: Student Forum Paper at ICASSP (2003)

    Google Scholar 

  19. Subashini, K., Palanivel, S., Ramalingam, V.: Audio-video based classification using SVM and AANN. Int. J. Comput. Appl. 44(6), 33–39 (2012)

    Google Scholar 

  20. Ibrahim, M.Z., Mulvaney, D.J., Abas, M.F.: Feature-fusion based audio-visual speech recognition using lip geometry features in noisy enviroment. ARPN J. Eng. Appl. Sci. 10, 17521–17527 (2015)

    Google Scholar 

  21. Chelali, F., Djeradi, A.: Audiovisual speaker identification based on lip and speech modalities. Int. Arab J. Inf. Technol. (IAJIT) 14 (2017)

    Google Scholar 

  22. Rahmani, M.H., Almasganj, F., Seyyedsalehi, S.A.: Audio-visual feature fusion via deep neural networks for automatic speech recognition. Digit. Signal Proc. 82, 54–63 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Addarrazi, I., Satori, H., Satori, K. (2020). A Follow-Up Survey of Audiovisual Speech Integration Strategies. In: Bhateja, V., Satapathy, S., Satori, H. (eds) Embedded Systems and Artificial Intelligence. Advances in Intelligent Systems and Computing, vol 1076. Springer, Singapore. https://doi.org/10.1007/978-981-15-0947-6_60

Download citation

Publish with us

Policies and ethics