A Data Driven Approach to Audiovisual Speech Mapping

Abel, Andrew; Marxer, Ricard; Barker, Jon; Watt, Roger; Whitmer, Bill; Derleth, Peter; Hussain, Amir

doi:10.1007/978-3-319-49685-6_30

A Data Driven Approach to Audiovisual Speech Mapping

Andrew Abel¹⁹,
Ricard Marxer²⁰,
Jon Barker²⁰,
Roger Watt¹⁹,
Bill Whitmer²¹,
Peter Derleth²² &
…
Amir Hussain¹⁹

Conference paper
First Online: 13 November 2016

1720 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10023))

Abstract

The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abel, A., Hussain, A.: Novel two-stage audiovisual speech filtering in noisy environments. Cogn. Comput. 6, 1–18 (2013)
Google Scholar
Abel, A., Hussain, A.: Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System, vol. 5. Springer, New York (2015)
Google Scholar
Almajai, I., Milner, B.: Effective visually-derived Wiener filtering for audio-visual speech processing. In: Proceedings of Interspeech, Brighton, UK (2009)
Google Scholar
Bear, H., Harvey, R.: Decoding visemes: improving machine lip-reading. In: International Conference on Acoustics, Speech and Signal Processing (2016)
Google Scholar
Bear, H.L., Harvey, R.W., Theobald, B.-J., Lan, Y.: Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 230–239. Springer, Heidelberg (2014). doi:10.1007/978-3-319-14364-4_22
Google Scholar
Cappelletta, L., Harte, N.: Phoneme-to-viseme mapping for visual speech recognition. In: ICPRAM (2), pp. 322–329 (2012)
Google Scholar
Chung, K.: Challenges and recent developments in hearing aids part I. speech understanding in noise, microphone technologies and noise reduction algorithms. Trends Amplif. 8(3), 83–124 (2004)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5 Pt 1), 2421–2424 (2006)
Article Google Scholar
Dakin, S.C., Watt, R.J.: Biological bar codes in human faces. J. Vis. 9(4), 2:1–2:10 (2009)
Article Google Scholar
Fu, S., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P.K., Garcia, O.N.: Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans. Multimedia 7(2), 243–252 (2005)
Article Google Scholar
Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving visual features for lip-reading. In: AVSP, pp. 7–3 (2010)
Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Milner, B., Websdale, D.: Analysing the importance of different visual feature coefficients. In: FAAVSP 2015 (2015)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)
Google Scholar
Pahar, M.: A novel sound reconstruction technique based on a spike code (event) representation. Ph.D. thesis, Computing Science and Mathematics, University of Stirling, Stirling, Scotland (2016)
Google Scholar
Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008)
Article Google Scholar
Sumby, W., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)
Article Google Scholar
Tóth, L.: Phone recognition with hierarchical convolutional deep maxout networks. EURASIP J. Audio Speech Music Process. 2015(1), 1–13 (2015)
Article Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I-511. IEEE (2001)
Google Scholar

Download references

Acknowledgements

This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1 (CogAVHearing-http://cogavhearing.cs.stir.ac.uk). In accordance with EPSRC policy, all experimental data used in the project simulations is available at http://hdl.handle.net/11667/81. The authors would also like to gratefully acknowledge Prof. Leslie Smith and Dr Ahsan Adeel at the University of Stirling, Dr Kristína Malinovská at Comenius University in Bratislava, and the anonymous reviewers for their helpful comments and suggestions.

Author information

Authors and Affiliations

University of Stirling, Stirling, FK9 4LA, Scotland
Andrew Abel, Roger Watt & Amir Hussain
University of Sheffield, Sheffield, S1 4DP, UK
Ricard Marxer & Jon Barker
MRC/CSO IHR - Scottish Section, GRI, Glasgow, G31 2ER, Scotland
Bill Whitmer
Sonova AG, 8712, Staefa, Switzerland
Peter Derleth

Authors

Andrew Abel
View author publications
You can also search for this author in PubMed Google Scholar
Ricard Marxer
View author publications
You can also search for this author in PubMed Google Scholar
Jon Barker
View author publications
You can also search for this author in PubMed Google Scholar
Roger Watt
View author publications
You can also search for this author in PubMed Google Scholar
Bill Whitmer
View author publications
You can also search for this author in PubMed Google Scholar
Peter Derleth
View author publications
You can also search for this author in PubMed Google Scholar
Amir Hussain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Hussain .

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Computing Science and Mathematics, University of Stirling, Stirling, United Kingdom
Amir Hussain
Anhui University, Anhui, China
Bin Luo
National University of Singapore, Singapore, Singapore
Kay Chen Tan
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Yi Zeng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abel, A. et al. (2016). A Data Driven Approach to Audiovisual Speech Mapping. In: Liu, CL., Hussain, A., Luo, B., Tan, K., Zeng, Y., Zhang, Z. (eds) Advances in Brain Inspired Cognitive Systems. BICS 2016. Lecture Notes in Computer Science(), vol 10023. Springer, Cham. https://doi.org/10.1007/978-3-319-49685-6_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-49685-6_30
Published: 13 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49684-9
Online ISBN: 978-3-319-49685-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics