Skip to main content
Log in

Multimodal speech recognition: increasing accuracy using high speed video data

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

To date, multimodal speech recognition systems based on the processing of audio and video signals show significantly better results than their unimodal counterparts. In general, researchers divide the solution of the audio–visual speech recognition problem into two parts. First, in extracting the most informative features from each modality and second, in the most successful way of fusion both modalities. Ultimately, this leads to an improvement in the accuracy of speech recognition. Almost all modern studies use this approach with video data of a standard recording speed of 25 frames per second. The choice of such a recording speed is easily explained, since the vast majority of existing audio–visual databases are recorded with this rate. However, it should be noticed that the number of 25 frames per second is a world standard for many areas and has never been specifically calculated for speech recognition tasks. The main purpose of this study is to investigate the effect brought by the high-speed video data (up to 200 frames per second) on the speech recognition accuracy. And also to find out whether the use of a high-speed video camera makes the speech recognition systems more robust to acoustical noise. To this end, we recorded a database of audio–visual Russian speech with high-speed video recordings, which consists of records of 20 speakers, each of them pronouncing 200 phrases of continuous Russian speech. Experiments performed on this database showed an improvement in the absolute speech recognition rate up to 3.10%. We also proved that the use of the high-speed camera with 200 fps allows achieving better recognition results under different acoustically noisy conditions (signal-to-noise ratio varied between 40 and 0 dB) with different types of noise (e.g. white noise, babble noise).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  2. Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D, Sison J, Mashari A, Zhou J (2000) Audio visual speech recognition. In: Final workshop 2000 report. Center for Language and Speech Processing, The Johns Hopkins University, Baltimore

  3. Katsaggelos K, Bahaadini S, Molina R (2015) Audiovisual fusion: challenges and new approaches. In: Proceedings of the IEEE, vol 103(9), pp 1635–1653

    Article  Google Scholar 

  4. Dean D, Sridharan S (2010) Dynamic visual features for audio–visual speaker verification. Comput Speech Lang 24(2):136–149

    Article  Google Scholar 

  5. Luckyanets E, Melnikov A, Kudashev O, Novoselov S, Lavrentyeva G (2017) Bimodal anti-spoofing system for mobile security. In: SPECOM 2017, LNAI 10458, pp 211–220

    Chapter  Google Scholar 

  6. Akhtiamov O, Sidorov M, Karpov A, Minker W (2017) Speech and text analysis for multimodal addressee detection in human–human–computer interaction. In: Proceedings of the interspeech 2017, pp 2521–2525

  7. Shamim HM, Muhammad G (2016) Audio–visual emotion recognition using multi-directional regression and ridgelet transform. J Multimodal User Interfaces (JMUI) 10(4):325–333

    Article  Google Scholar 

  8. Fedotov D, Sidorov M, Minker W (2017) Context-awarded models in time-continuous multidimensional affect recognition. In: ICR 2017, LNAI 10459, pp 59–66

    Google Scholar 

  9. Liu Q, Wang W, Jackson P (2011) A visual voice activity detection method with adaboosting. In: Proceedings of the sensor signal process defence, pp 1–5

  10. Barnard M et al (2014) Robust multi-speaker tracking via dictionary learning and identity modeling. IEEE Trans Multimed 16(3):864–880

    Article  Google Scholar 

  11. Kaya H, Karpov A (2017) Introducing weighted kernel classifiers for handling imbalanced paralinguistic corpora: snoring, addressee and cold. In: Proceedings of the interspeech 2017, pp 3527–3531

  12. Shivappa ST, Trivedi ST (2010) Audiovisual information fusion in human–computer interfaces and intelligent environments: a survey. Proc IEEE 98(10):1692–1715

    Article  Google Scholar 

  13. Khokhlov Y, Tomashenko N, Medennikov I, Romanenko A (2017) Fast and accurate OOV decoder on high-level features. In: Proceedings of the interspeech 2017, pp 2884–2888

  14. Ngiam J et al (2011) Multimodal deep learning. In: Proceedings of the 28th international conference of machine learning, pp 689–696

  15. Chetty G, Wagner M (2006) Audio–visual multimodal fusion for biometric person authentication and liveness verification. In: Proceedings of the NICTA-HCSNet multimodal user interaction workshop, vol 57, pp 17–24

  16. Atrey PK, Hossain MA, Saddik E, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379

    Article  Google Scholar 

  17. Xu H, Chua TS (2006) Fusion of AV features and external information sources for event detection in team sport video. ACM Trans Multimed Comput Commun Appl 2(1):44–67

    Article  Google Scholar 

  18. Dean D.B (2008) Synchronous HMMs for audio–visual speech processing. Ph.D. dissertation, Queensland University

  19. Morency LP, Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agents Multi-Agents Syst 20(1):70–84

    Article  Google Scholar 

  20. Lv G, Jiang D, Zhao R, Hou Y (2007) Multi-stream asynchrony modeling for audio–visual speech recognition. In: Proceedings of the 9th IEEE international symposium multimedia, pp 37–44

  21. Torres-Valencia C, Alvarez-Lopez M, Orozco-Gutierrez A (2017) SVM-based feature selection methods for emotion recognition from multimodal data. J Multimodal User Interfaces (JMUI) 11(1):9–23

    Article  Google Scholar 

  22. Terry L (2011) Audio–visual asynchrony modeling and analysis for speech alignment and recognition. Ph.D. dissertation, Northwestern University

  23. Nefian AV et al (2002) A coupled HMM for audio–visual speech recognition. In: Proceedings of the IEEE international conference acoustic speech signal processing, vol 2, pp 2009–2013

  24. Estellers V, Gurban M, Thiran J (2012) On dynamic stream weighting for audio–visual speech recognition. IEEE Trans Audio Speech Lang Process 20(4):1145–1157

    Article  Google Scholar 

  25. Abdelaziz AH, Kolossa D (2014) Dynamic stream weight estimation in coupled HMM-based audio–visual speech recognition using multilayer perceptrons. In: Proceedings of the interspeech, pp 1144–1148

  26. Chitu AG, Rothkrantz LJM (2007) The influence of video sampling rate on lipreading performance. In: Proceedings of the international conference on speech and computer SPECOM 2007. Moscow, pp 678–684

  27. Chitu AG, Driel K, Rothkrantz LJM (2010) Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In: Text, speech and dialogue, Springer LNCS (LNAI) 2010, vol 6231, pp 259–266

    Chapter  Google Scholar 

  28. Polykovsky S, Kameda Y, Ohta Y (2009) Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In: Proceedings of the 3rd international conference on crime detection and prevention (ICDP). Tsukuba, pp 1–6

  29. Bettadapura V (2012) Face expression recognition and analysis: the state of the art. Technical Report, College of Computing, Georgia Institute of Technology, pp 1–27

  30. Ohzeki K (2006) Video analysis for detecting eye blinking using a high-speed camera. In: Proceedings of the 40th Asilomar conference on signals, systems and computers (ACSSC). Pacific Grove, Part 1, pp 1081–1085

  31. Chitu AG, Rothkrantz LJM (2008) On dual view lipreading using high speed camera. In: Proceedings of the 14th annual scientific conference euromedia. Ghent, pp 43–51

  32. Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Zelezny M (2016) HAVRUS corpus: high-speed recordings of audio–visual Russian speech. In: Ronzhin A, Potapova R, Nmeth G (eds) Speech and computer. SPECOM 2016. Lecture notes in computer science, vol 9811. Springer, Cham

    Chapter  Google Scholar 

  33. Karpov A, Ronzhin A, Markov K, Zelezny M (2010) Viseme-dependent weight optimization for CHMM-based audio–visual speech recognition. In: Proceedings of the interspeech 2010, pp 2678–2681

  34. Karpov A (2014) An automatic multimodal speech recognition system with audio and video information. Autom Remote Control 75(12):2190–2200

    Article  MathSciNet  Google Scholar 

  35. Ivanko D, Karpov A, Ryumin D, Kipyatkova I, Saveliev A, Budkov V, Ivanko D, Zelezny M (2017) Using a high-speed video Camera for robust audio–visual speech recognition in acoustically noisy conditions. In: SPECOM 2017, LNAI 10458, pp 757–766

    Chapter  Google Scholar 

  36. Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, Huang T (2004) AVICAR: audio–visual speech corpus in a car environment. In: Proceedings of the interspeech, pp 380–383

  37. Cox S, Harvey R, Lan Y, Newman J, Theobald B (2008) The challenge of multispeaker lip-reading. In: Proceedings of the international conference auditory-visual speech process (AVSP), pp 179–184

  38. Patterson E, Gurbuz S, Tufekci Z, Gowdy J (2002) CUAVE: a new audio–visual database for multimodal human–computer interface research. In: Proceedings of the IEEE ICASSP 2002, vol 2, pp 2017–2020

  39. Hazen T, Saenko K, La C, Glass J (2004) A segment-base audio–visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the international conference multimodal interfaces, pp 235–242

  40. Lucey P, Potaminanos G, Sridharan S (2008) Patch-based analysis of visual speech from multiple views. In: Proceedings of the AVSP 2008, pp 69–74

  41. Abhishek N, Prasanta KG (2017) PRAV: a phonetically rich audio visual corpus. In: Proceedings of the interspeech 2017, pp 3747–3751

  42. Zhou Z, Zhao G, Hong X, Pietikainen M (2014) A review of recent advances in visual speech decoding. In: Proceedings of the image and vision computing, vol 32, pp 590–605

    Article  Google Scholar 

  43. Karpov A, Kipyatkova I, Zelezny M (2014) A framework for recording audio–visual speech corpora with a microphone and a high-speed camera. In: Speech and computer. SPECOM 2014. Lecture notes in computer science, vol 8773. Springer, Cham

    Google Scholar 

  44. Yan S, Xu D, Zhang H, Yang Q, Lin S (2007) Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell 29(1):40–51

    Article  Google Scholar 

  45. Hong S, Yao H, Wan Y, Chen R (2006) A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of the intelligent informatics hiding multimedia, signal process, pp 321–326

  46. Yoshinaga T, Tamura S, Iwano K, Furui S (2003) Audio–visual speech recognition using lip movement extracted from side-face images. In: Proceedings of the international conference auditory-visual speech processing (AVSP), pp 117–120

  47. Cetingul H, Yemez Y, Erzin E, Tekalp A (2006) Discriminative analysis of lip motion features for speaker identification and speech reading. IEEE Trans Image Process 15(10):2879–2891

    Article  Google Scholar 

  48. Kumar S, Bhuyan MK, Chakraborty BK (2017) Extraction of texture and geometrical features from informative facial regions for sign language recognition. J Multimodal User Interfaces (JMUI) 11(2):227–239

    Article  Google Scholar 

  49. Lan Y, Theobald B, Harvey E, Ong E, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the AVSP 2010, pp 142–147

  50. Chu SM, Huang TS (2002) Multi-modal sensory fusion with application to audio–visual speech recognition. In: Proceedings of the multi-modal speech recognition workshop-2002, Greensboro

  51. Bear H, Harvey R, Theobald B, Lan Y (2014) Which phoneme-to-viseme maps best improve visual-only computer lip-reading. In: Advances in visual computing. Springer, Berlin, pp 230–239

    Google Scholar 

  52. Stewart D, Seymour R, Pass A, Ming J (2014) Robust audio–visual speech recognition under noisy audio–video conditions. IEEE Trans Cybern 44(2):175–184

    Article  Google Scholar 

  53. Huang J, Kingsbury B (2013) Audio–visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, pp 7596–7599

Download references

Acknowledgements

This research is supported by the Ministry of Education and Science of the Russian Federation (Project No. 8.9957.2017/5.2), by the Government of Russia (Grant No. 08-08), by the Russian Foundation for Basic Research (Project Nos. 18-37-00306, 16-37-60100), by the Council for Grants of the President of the Russian Federation (Project Nos. MD-254.2017.8, MK-1000.2017.8), by the Russian state research (No. 0073-2018-0002), by the Ministry of Education of the Czech Republic (Project No. LTARF18017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denis Ivanko.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ivanko, D., Karpov, A., Fedotov, D. et al. Multimodal speech recognition: increasing accuracy using high speed video data. J Multimodal User Interfaces 12, 319–328 (2018). https://doi.org/10.1007/s12193-018-0267-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-018-0267-1

Keywords

Navigation