Abstract
Automatic lip-reading (ALR) is a challenging task and a significant amount of research has been devoted to this topic in recent years. However, continuous Russian speech recognition still remains a not well-investigated area. In this paper, we present the results of Russian visual speech recognition (VSR) system using pixel-based and advanced geometry-based features. A HAVRUS video database, comprising of 4000 utterances of continuous Russian speech, collected from 20 speakers, is used in this study. Pixel-based features (principal component analysis-based or PCA) and geometry-based features (active appearance model-based or AAM) were used for the feature representation, and a Gaussian mixture hidden Markov models (HMM) were used for classification. Our evaluation experiments show a significant improvement (up to 9%) in recognition accuracy by using proposed geometry-based features when compared to pixel-based PCA features. The combined VSR is planned for future studies to augment the performance of audio-based automatic speech recognition systems in human–robot interfaces.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Karpov, A.: Dr. Sc. Tech. (SPIIRAS, Computer Science), thesis Audio-visual speech interfaces in assistive information technologies (2013)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)
Akhtiamov, O., Sidorov, M., Karpov, A., Minker, W.: Speech and text analysis for multimodal address detection in human-human-computer interaction. Proc. Interspeech 2017, 2521–2525 (2017)
Ivanko, D., Fedotov, D., Karpov, A.: Accuracy increase for automatic visual Russian speech recognition. Sci. Tech. J. Inf. Technol., Me-Chanics Opt. 18(3), 346–349 (2018)
Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high speed video data. J. Multimodal User Interfaces 12(4), 319–328 (2018)
Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A Review of Recent Advances in Visual Speech Decoding. Image Vis. Comput. 32, 590–605 (2014)
Hassanat, A.: Visual words for automatic lip-reading. Ph.D. thesis, Department of Applied Computing, University of Buckingham, UK (2009)
Verkhodanova, V., et al.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science 9811, pp. 338–345 (2016)
Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. Proc. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 181–187 (2014)
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Proc. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proceedings of International Conference Multimedia Expo (ICME), pp. 432–437 (2012)
Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA Based visual DCT feature extraction method for lip-reading. In: Proc. Int. Conf. Intell. Inf. Hiding Multimed., Signal Process, pp. 321–326 (2006)
Cetingul, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech-reading. Proc. IEEE Trans. Image Process. 15(10), 2879–2891 (2006)
Ivanko, D., Ryumin, D., Axyonov, A., Karpov, A., Zelezny, M.: Designing advanced geometric features for automatic Russian visual speech recognition. In: International Conference on Speech and Computer. SPECOM 2018. LNAI 11096, pp. 245–254 (2018)
Estellers, V., Thiran, J.-P.: Multi-pose lipreading and audio-visual speech recognition. EURASIP J. Adv. Signal Process. 51. https://doi.org/10.1186/1687-6180-2012-51 (2012)
Siwar, J., Ben, Y. B., Arnaud, M.: Belief hidden markov model for speech recognition. In: 5th International Conference on Modeling, Simulation and Applied Optimization (ICMSAO) (2013)
Sak, H., Senior, A. W., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network. Acoust. Model. Speech Recognit. CoRR, vol. abs/1507.06947 (2015)
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Hoang, L.T., Nhat, V.T.: Face alignment using active shape model and support vector machine. Int. J. Biom. Bioinform. 4(6), 224–234 (2011)
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 681–685 (2008)
Lan, Y., et al.: Improving visual features for lip reading. AVSP (2010)
Vahid, K., Josephine, S.: One millisecond face alignment with an ensemble of regression trees. CVPR (2014)
Hidden Markov Model Toolkit (HTK). Available at: http://htk.eng.cam.ac.uk/. Accessed 15 Feb 2019
Karpov, A.: An automatic multimodal speech recognition system with audio and video information. Autom. Remote. Control. 75(12), 2190–2200 (2014)
Ivanko, D., et al.: Using a high-speed video Camera for robust audio-visual speech recognition in acoustically noisy conditions. SPECOM 2017, LNAI 10458, pp. 757–766 (2017)
Acknowledgements
This research is supported by the Russian Foundation for Basic Research (project No. 18-37-00306 and project No. 18-07-01216) and by the Government of Russian Federation (Grant 08-08).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ivanko, D., Ryumin, D., Kipyatkova, I., Axyonov, A., Karpov, A. (2020). Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces. In: Ronzhin, A., Shishlakov, V. (eds) Proceedings of 14th International Conference on Electromechanics and Robotics “Zavalishin's Readings”. Smart Innovation, Systems and Technologies, vol 154. Springer, Singapore. https://doi.org/10.1007/978-981-13-9267-2_39
Download citation
DOI: https://doi.org/10.1007/978-981-13-9267-2_39
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9266-5
Online ISBN: 978-981-13-9267-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)