Skip to main content

Multimodal Speech Separation

  • Conference paper
Advances in Nonlinear Speech Processing (NOLISP 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5933))

Included in the following conference series:

Abstract

The work of Bernstein and Benoît has confirmed that it is advantageous to use multiple senses, for example to employ both audio and visual modalities, in speech perception. As a consequence, looking at the speaker’s face can be useful to better hear a speech signal in a noisy environment and to extract it from competing sources, as originally identified by Cherry, who posed the so-called “Cocktail Party” problem. To exploit the intrinsic coherence between audition and vision within a machine, the method of blind source separation (BSS) is particularly attractive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aubrey, A., Rivet, B., Hicks, Y., Girin, L., Chambers, J., Jutten, C.: Two novel visual voice activity detectors based on appearance models and retinal filltering. In: Proc. European Signal Processing Conference (EUSIPCO), Poznan, Poland, September 2007, pp. 2409–2413 (2007)

    Google Scholar 

  2. Benoît, C., Mohamadi, T., Kandel, S.: Effects of phonetic context on audio-visual intelligibility of French. J. Speech and Hearing Research 37, 1195–1293 (1994)

    Google Scholar 

  3. Bernstein, L.E., Auer, E.T.J., Takayanagi, S.: Auditory speech detection in noise enhanced by lipreading. Speech Communication 44(1-4), 5–18 (2004)

    Article  Google Scholar 

  4. Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. Journal of Acoustical Society of America 25(5), 975–979 (1953)

    Article  Google Scholar 

  5. Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3), 287–314 (1994)

    Article  MATH  Google Scholar 

  6. Deligne, S., Potamianos, G., Neti, C.: Audio-Visual speech enhancement with AVCDCN (AudioVisual Codebook Dependent Cepstral Normalization). In: Proc. Int. Conf. Spoken Language Processing (ICSLP), Denver, Colorado, USA, September 2002, pp. 1449–1452 (2002)

    Google Scholar 

  7. Ephraim, Y., Malah, D.: Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(6), 1109–1121 (1984)

    Article  Google Scholar 

  8. Erber, N.P.: Interaction of audition et vision in the recognition of oral speech stimuli. J. Speech and Hearing Research 12, 423–425 (1969)

    Google Scholar 

  9. Gannot, S., Burshtein, D., Weinstein, E.: Iterative and sequential kalman filter-based speech enhancement algorithms. IEEE Transactions on Speech and Audio Processing 6(4), 373–385 (1998)

    Article  Google Scholar 

  10. Girin, L., Allard, A., Schwartz, J.-L.: Speech signals separation: a new approach exploiting the coherence of audio and visual speech. In: IEEE Int. Workshop on Multimedia Signal Processing (MMSP), Cannes, France (2001)

    Google Scholar 

  11. Goecke, R., Potamianos, G., Neti, C.: Noisy audio feature enhancement using audio-visual speech data. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Orlando, USA, May 2002, pp. 2025–2028 (2002)

    Google Scholar 

  12. Grant, K.W., Seitz, P.-F.: The use of visible speech cues for improving auditory detection of spoken sentences. Journal of Acoustical Society of America 108, 1197–1208 (2000)

    Article  Google Scholar 

  13. Hérault, J., Jutten, C.: Space or time adaptive signal processing by neural networks models. In: Intern. Conf. on Neural Networks for Computing, Snowbird, USA, pp. 206–211 (1986)

    Google Scholar 

  14. Jutten, C., Hérault, J.: Blind separation of sources. Part I: An adaptive algorithm based on a neuromimetic architecture. Signal Processing 24(1), 1–10 (1991)

    Article  MATH  Google Scholar 

  15. Jutten, C., Taleb, A.: Source separation: from dusk till dawn. In: Proc. Int. Conf. Independent Component Analysis and Blind Source Separation (ICA), Helsinki, Finland, June 2000, pp. 15–26 (2000)

    Google Scholar 

  16. Kim, J., Chris, D.: Investigating the audio–visual speech detection advantage. Speech Communication 44(1-4), 19–30 (2004)

    Article  Google Scholar 

  17. McGurk, H., McDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)

    Article  Google Scholar 

  18. Milner, B., Almajai, I.: Noisy audio speech enhancement using wiener filters dervied from visual speech. In: Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), Moreton Island, Australia (September 2007)

    Google Scholar 

  19. Naqvi, S.M., Zhang, Y., Tsalaile, T., Sanei, S., Chambers, J.A.: A multimodal approach for frequency domain independent component analysis with geometrically-based initialization. In: Proc. EUSIPCO, Lausanne, Switzerland (2008)

    Google Scholar 

  20. Naqvi, S.M., Zhang, Y., Tsalaile, T., Sanei, S., Chambers, J.A.: A multimodal approach for frequency domain independent component analysis with geometricallybased initialization. In: Proc. EUSIPCO, Lausanne, Switzerland (2008)

    Google Scholar 

  21. Potamianos, G., Neti, C., Deligne, S.: Joint Audio-Visual Speech Processing for Recognition and Enhancement. In: Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), St. Jorioz, France (September 2003)

    Google Scholar 

  22. Rivet, B., Girin, L., Jutten, C.: Solving the indeterminations of blind source separation of convolutive speech mixtures. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA, March 2005, pp. V-533–V-536 (2005)

    Google Scholar 

  23. Rivet, B., Girin, L., Jutten, C.: Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Transactions on Audio, Speech and Language Processing 15(3), 796–802 (2007)

    Article  Google Scholar 

  24. Rivet, B., Girin, L., Jutten, C.: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Transactions on Audio, Speech and Language Processing 15(1), 96–108 (2007)

    Article  Google Scholar 

  25. Rivet, B., Girin, L., Jutten, C.: Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Communication 49(7-8), 667–677 (2007)

    Article  Google Scholar 

  26. Rivet, B., Girin, L., Servière, C., Pham, D.-T., Jutten, C.: Using a visual voice activity detector to regularize the permutations in blind source separation of convolutive speech mixtures. In: Proc. Int. Conf. on Digital Signal Processing (DSP), Cardiff, Wales UK, July 2007, pp. 223–226 (2007)

    Google Scholar 

  27. Sanei, S., Naqvi, S.M., Chambers, J.A., Hicks, Y.: A geometrically constrained multimodal approach for convolutive blind source separation. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, USA, April 2007, pp. 969–972 (2007)

    Google Scholar 

  28. Sodoyer, D., Girin, L., Jutten, C., Schwartz, J.-L.: Developing an audio-visual speech source separation algorithm. Speech Communication 44(1-4), 113–125 (2004)

    Article  Google Scholar 

  29. Sodoyer, D., Girin, L., Savariaux, C., Schwartz, J.-L., Rivet, B., Jutten, C.: A study of lip movements during spontaneous dialog and its application to voice activity detection. Journal of Acoustical Society of America 125(2), 1184–1196 (2009)

    Article  Google Scholar 

  30. Sodoyer, D., Rivet, B., Girin, L., Schwartz, J.-L., Jutten, C.: An analysis of visual speech information applied to voice activity detection. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, pp. 601–604 (2006)

    Google Scholar 

  31. Sodoyer, D., Schwartz, J.-L., Girin, L., Klinkisch, J., Jutten, C.: Separation of audio-visual speech sources: a new approach exploiting the audiovisual coherence of speech stimuli. Eurasip Journal on Applied Signal Processing 2002(11), 1165–1173 (2002)

    Article  MATH  Google Scholar 

  32. Stork, D.G., Hennecke, M.E.: Speechreading by Humans and Machines. Springer, Berlin (1996)

    MATH  Google Scholar 

  33. Sumby, W., Pollack, I.: Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 26, 212–215 (1954)

    Article  Google Scholar 

  34. Summerfield, Q.: Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd, B., Campbell, R. (eds.) Hearing by Eye: The Psychology of Lipreading, pp. 3–51. Lawrence Erlbaum Associates, Mahwah (1987)

    Google Scholar 

  35. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Kauai, Hawaii, USA, December 2001, pp. 511–518 (2001)

    Google Scholar 

  36. Wang, W., Cosker, D., Hicks, Y., Sanei, S., Chambers, J.A.: Video assisted speech source separation. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA (March 2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rivet, B., Chambers, J. (2010). Multimodal Speech Separation. In: Solé-Casals, J., Zaiats, V. (eds) Advances in Nonlinear Speech Processing. NOLISP 2009. Lecture Notes in Computer Science(), vol 5933. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11509-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11509-7_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11508-0

  • Online ISBN: 978-3-642-11509-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics