Skip to main content

A Data Driven Approach to Audiovisual Speech Mapping

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10023))

Abstract

The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Abel, A., Hussain, A.: Novel two-stage audiovisual speech filtering in noisy environments. Cogn. Comput. 6, 1–18 (2013)

    Google Scholar 

  2. Abel, A., Hussain, A.: Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System, vol. 5. Springer, New York (2015)

    Google Scholar 

  3. Almajai, I., Milner, B.: Effective visually-derived Wiener filtering for audio-visual speech processing. In: Proceedings of Interspeech, Brighton, UK (2009)

    Google Scholar 

  4. Bear, H., Harvey, R.: Decoding visemes: improving machine lip-reading. In: International Conference on Acoustics, Speech and Signal Processing (2016)

    Google Scholar 

  5. Bear, H.L., Harvey, R.W., Theobald, B.-J., Lan, Y.: Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 230–239. Springer, Heidelberg (2014). doi:10.1007/978-3-319-14364-4_22

    Google Scholar 

  6. Cappelletta, L., Harte, N.: Phoneme-to-viseme mapping for visual speech recognition. In: ICPRAM (2), pp. 322–329 (2012)

    Google Scholar 

  7. Chung, K.: Challenges and recent developments in hearing aids part I. speech understanding in noise, microphone technologies and noise reduction algorithms. Trends Amplif. 8(3), 83–124 (2004)

    Google Scholar 

  8. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5 Pt 1), 2421–2424 (2006)

    Article  Google Scholar 

  9. Dakin, S.C., Watt, R.J.: Biological bar codes in human faces. J. Vis. 9(4), 2:1–2:10 (2009)

    Article  Google Scholar 

  10. Fu, S., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P.K., Garcia, O.N.: Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans. Multimedia 7(2), 243–252 (2005)

    Article  Google Scholar 

  11. Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving visual features for lip-reading. In: AVSP, pp. 7–3 (2010)

    Google Scholar 

  12. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)

    Article  Google Scholar 

  13. Milner, B., Websdale, D.: Analysing the importance of different visual feature coefficients. In: FAAVSP 2015 (2015)

    Google Scholar 

  14. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)

    Google Scholar 

  15. Pahar, M.: A novel sound reconstruction technique based on a spike code (event) representation. Ph.D. thesis, Computing Science and Mathematics, University of Stirling, Stirling, Scotland (2016)

    Google Scholar 

  16. Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008)

    Article  Google Scholar 

  17. Sumby, W., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)

    Article  Google Scholar 

  18. Tóth, L.: Phone recognition with hierarchical convolutional deep maxout networks. EURASIP J. Audio Speech Music Process. 2015(1), 1–13 (2015)

    Article  Google Scholar 

  19. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I-511. IEEE (2001)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1 (CogAVHearing-http://cogavhearing.cs.stir.ac.uk). In accordance with EPSRC policy, all experimental data used in the project simulations is available at http://hdl.handle.net/11667/81. The authors would also like to gratefully acknowledge Prof. Leslie Smith and Dr Ahsan Adeel at the University of Stirling, Dr Kristína Malinovská at Comenius University in Bratislava, and the anonymous reviewers for their helpful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amir Hussain .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Abel, A. et al. (2016). A Data Driven Approach to Audiovisual Speech Mapping. In: Liu, CL., Hussain, A., Luo, B., Tan, K., Zeng, Y., Zhang, Z. (eds) Advances in Brain Inspired Cognitive Systems. BICS 2016. Lecture Notes in Computer Science(), vol 10023. Springer, Cham. https://doi.org/10.1007/978-3-319-49685-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49685-6_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49684-9

  • Online ISBN: 978-3-319-49685-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics