Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

Zimmermann, Marina; Mehdipour Ghazi, Mostafa; Ekenel, Hazım Kemal; Thiran, Jean-Philippe

doi:10.1007/978-3-319-54427-4_20

Marina Zimmermann¹⁶,
Mostafa Mehdipour Ghazi¹⁷,
Hazım Kemal Ekenel¹⁸ &
…
Jean-Philippe Thiran¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10117))

Included in the following conference series:

Asian Conference on Computer Vision

2303 Accesses
10 Citations
3 Altmetric

Abstract

Automatic visual speech recognition is an interesting problem in pattern recognition especially when audio data is noisy or not readily available. It is also a very challenging task mainly because of the lower amount of information in the visual articulations compared to the audible utterance. In this work, principle component analysis is applied to the image patches — extracted from the video data — to learn the weights of a two-stage convolutional network. Block histograms are then extracted as the unsupervised learning features. These features are employed to learn a recurrent neural network with a set of long short-term memory cells to obtain spatiotemporal features. Finally, the obtained features are used in a tandem GMM-HMM system for speech recognition. Our results show that the proposed method has outperformed the baseline techniques applied to the OuluVS2 audiovisual database for phrase recognition with the frontal view cross-validation and testing sentence correctness reaching 79% and 73%, respectively, as compared to the baseline of 74% on cross-validation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The baseline results can be found at http://ouluvs2.cse.oulu.fi/preliminary.html.

References

Anina, I., Zhou, Z., Zhao, G., Pietikainen, M.: OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2015)
Google Scholar
Badin, P., Bailly, G., Revéret, L., Baciu, M., Segebarth, C., Savariaux, C.: Three-dimensional linear articulatory modeling of tongue lips and face, based on MRI and video images. J. Phonet. 30(3), 533–553 (2002)
Article Google Scholar
Biswas, A., Sahu, P., Chandra, M.: Multiple camera in car audio–visual speech recognition using phonetic and visemic information. Comput. Electr. Eng. 47, 35–50 (2015)
Article Google Scholar
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition. Springer Nature, Berlin (1994)
Book Google Scholar
Bowden, R., Cox, S., Harvey, R., Lan, Y., Ong, E.J., Owen, G., Theobald, B.J.: Recent developments in automated lip-reading. In: Zamboni, R., Kajzar, F., Szep, A.A., Burgess, D., Owen, G. (eds.) Optics and photonics for counterterrorism crime fighting and defence IX and optical materials and biomaterials in security and defence systems technology X. In: SPIE-The International Society of Optics and Photonics (2013)
Google Scholar
Chan, T.H., Jia, K., Gao, S., Lu, J., Zeng, Z., Ma, Y.: PCANet: A simple deep learning baseline for image classification? IEEE Trans. Image Process. 24(12), 5017–5032 (2015)
Article MathSciNet Google Scholar
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., Saenko, K.: Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Estellers, V., Thiran, J.P.: Multi-pose lipreading and audio-visual speech recognition. EURASIP J. Adv. Sig. Process. 2012(1), 51 (2012)
Article Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Jebara, T., Xing, E.P. (eds.) Proceedings of 31st International Conference on Machine Learning (ICML-2014), JMLR Workshop and Conference Proceedings, pp. 1764–1772 (2014)
Google Scholar
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics Speech and Signal Processing (2013)
Google Scholar
Harte, N., Gillen, E.: TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)
Article Google Scholar
Hassanat, A.: Visual passwords using automatic lip reading. Int. J. Sci.: Basic Appl. Res. (IJSBAR) 13(1) (2014)
Google Scholar
Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of 2000 IEEE International Conference on Acoustics Speech, and Signal Processing (2000)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics Speech and Signal Processing (2013)
Google Scholar
Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes for sign language. In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW) (2015)
Google Scholar
Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.: AVICAR: Audio-visual speech corpus in a car environment. In: 8th International Conference on Spoken Language Processing (2004)
Google Scholar
Lucey, P., Potamianos, G., Sridharan, S.: An extended pose-invariant lipreading system. In: Proceedings of AVSP 2007: International Conference on Auditory-Visual Speech Processing. International Speech Communication Association (2007)
Google Scholar
Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition. In: 2015 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2015)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of 28th International Conference on Machine Learning (ICML), pp. 689–696 (2011)
Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2014)
Article Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. In: Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) Issues in Visual and Audio-Visual Speech Processing, pp. 1–30. MIT Press, Cambridge (2004). Chap. 10
Google Scholar
Schmidt, C., Koller, O.: Using viseme recognition to improve a sign language translation system. In: International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 197–203 (2013)
Google Scholar
Sui, C., Bennamoun, M., Togneri, R.: Listening with your eyes: Towards a practical visual speech recognition system using deep Boltzmann machines. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Wand, M., Koutnik, J., Schmidhuber, J.: Lipreading with long short-term memory. In: 2016 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2016)
Google Scholar
Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book. Technical report (2002)
Google Scholar

Download references

Acknowledgement

This work was supported by TUBITAK project number 113E067 and by a Marie Curie FP7 Integration Grant within the 7th EU Framework Programme.

Author information

Authors and Affiliations

Signal Processing Laboratory (LTS5), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
Marina Zimmermann & Jean-Philippe Thiran
Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey
Mostafa Mehdipour Ghazi
Department of Computer Engineering, Istanbul Technical University (ITU), Istanbul, Turkey
Hazım Kemal Ekenel

Authors

Marina Zimmermann
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Mehdipour Ghazi
View author publications
You can also search for this author in PubMed Google Scholar
Hazım Kemal Ekenel
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Philippe Thiran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Zimmermann .

Editor information

Editors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chu-Song Chen
Tsinghua University, Beijing, China
Jiwen Lu
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Kai-Kuang Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zimmermann, M., Mehdipour Ghazi, M., Ekenel, H.K., Thiran, JP. (2017). Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10117. Springer, Cham. https://doi.org/10.1007/978-3-319-54427-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-54427-4_20
Published: 16 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54426-7
Online ISBN: 978-3-319-54427-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics