Multimodal Speech Separation

Rivet, Bertrand; Chambers, Jonathon

doi:10.1007/978-3-642-11509-7_1

Bertrand Rivet²¹ &
Jonathon Chambers²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5933))

Included in the following conference series:

International Conference on Nonlinear Speech Processing

580 Accesses
2 Citations

Abstract

The work of Bernstein and Benoît has confirmed that it is advantageous to use multiple senses, for example to employ both audio and visual modalities, in speech perception. As a consequence, looking at the speaker’s face can be useful to better hear a speech signal in a noisy environment and to extract it from competing sources, as originally identified by Cherry, who posed the so-called “Cocktail Party” problem. To exploit the intrinsic coherence between audition and vision within a machine, the method of blind source separation (BSS) is particularly attractive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aubrey, A., Rivet, B., Hicks, Y., Girin, L., Chambers, J., Jutten, C.: Two novel visual voice activity detectors based on appearance models and retinal filltering. In: Proc. European Signal Processing Conference (EUSIPCO), Poznan, Poland, September 2007, pp. 2409–2413 (2007)
Google Scholar
Benoît, C., Mohamadi, T., Kandel, S.: Effects of phonetic context on audio-visual intelligibility of French. J. Speech and Hearing Research 37, 1195–1293 (1994)
Google Scholar
Bernstein, L.E., Auer, E.T.J., Takayanagi, S.: Auditory speech detection in noise enhanced by lipreading. Speech Communication 44(1-4), 5–18 (2004)
Article Google Scholar
Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. Journal of Acoustical Society of America 25(5), 975–979 (1953)
Article Google Scholar
Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3), 287–314 (1994)
Article MATH Google Scholar
Deligne, S., Potamianos, G., Neti, C.: Audio-Visual speech enhancement with AVCDCN (AudioVisual Codebook Dependent Cepstral Normalization). In: Proc. Int. Conf. Spoken Language Processing (ICSLP), Denver, Colorado, USA, September 2002, pp. 1449–1452 (2002)
Google Scholar
Ephraim, Y., Malah, D.: Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(6), 1109–1121 (1984)
Article Google Scholar
Erber, N.P.: Interaction of audition et vision in the recognition of oral speech stimuli. J. Speech and Hearing Research 12, 423–425 (1969)
Google Scholar
Gannot, S., Burshtein, D., Weinstein, E.: Iterative and sequential kalman filter-based speech enhancement algorithms. IEEE Transactions on Speech and Audio Processing 6(4), 373–385 (1998)
Article Google Scholar
Girin, L., Allard, A., Schwartz, J.-L.: Speech signals separation: a new approach exploiting the coherence of audio and visual speech. In: IEEE Int. Workshop on Multimedia Signal Processing (MMSP), Cannes, France (2001)
Google Scholar
Goecke, R., Potamianos, G., Neti, C.: Noisy audio feature enhancement using audio-visual speech data. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Orlando, USA, May 2002, pp. 2025–2028 (2002)
Google Scholar
Grant, K.W., Seitz, P.-F.: The use of visible speech cues for improving auditory detection of spoken sentences. Journal of Acoustical Society of America 108, 1197–1208 (2000)
Article Google Scholar
Hérault, J., Jutten, C.: Space or time adaptive signal processing by neural networks models. In: Intern. Conf. on Neural Networks for Computing, Snowbird, USA, pp. 206–211 (1986)
Google Scholar
Jutten, C., Hérault, J.: Blind separation of sources. Part I: An adaptive algorithm based on a neuromimetic architecture. Signal Processing 24(1), 1–10 (1991)
Article MATH Google Scholar
Jutten, C., Taleb, A.: Source separation: from dusk till dawn. In: Proc. Int. Conf. Independent Component Analysis and Blind Source Separation (ICA), Helsinki, Finland, June 2000, pp. 15–26 (2000)
Google Scholar
Kim, J., Chris, D.: Investigating the audio–visual speech detection advantage. Speech Communication 44(1-4), 19–30 (2004)
Article Google Scholar
McGurk, H., McDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Milner, B., Almajai, I.: Noisy audio speech enhancement using wiener filters dervied from visual speech. In: Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), Moreton Island, Australia (September 2007)
Google Scholar
Naqvi, S.M., Zhang, Y., Tsalaile, T., Sanei, S., Chambers, J.A.: A multimodal approach for frequency domain independent component analysis with geometrically-based initialization. In: Proc. EUSIPCO, Lausanne, Switzerland (2008)
Google Scholar
Naqvi, S.M., Zhang, Y., Tsalaile, T., Sanei, S., Chambers, J.A.: A multimodal approach for frequency domain independent component analysis with geometricallybased initialization. In: Proc. EUSIPCO, Lausanne, Switzerland (2008)
Google Scholar
Potamianos, G., Neti, C., Deligne, S.: Joint Audio-Visual Speech Processing for Recognition and Enhancement. In: Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), St. Jorioz, France (September 2003)
Google Scholar
Rivet, B., Girin, L., Jutten, C.: Solving the indeterminations of blind source separation of convolutive speech mixtures. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA, March 2005, pp. V-533–V-536 (2005)
Google Scholar
Rivet, B., Girin, L., Jutten, C.: Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Transactions on Audio, Speech and Language Processing 15(3), 796–802 (2007)
Article Google Scholar
Rivet, B., Girin, L., Jutten, C.: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Transactions on Audio, Speech and Language Processing 15(1), 96–108 (2007)
Article Google Scholar
Rivet, B., Girin, L., Jutten, C.: Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Communication 49(7-8), 667–677 (2007)
Article Google Scholar
Rivet, B., Girin, L., Servière, C., Pham, D.-T., Jutten, C.: Using a visual voice activity detector to regularize the permutations in blind source separation of convolutive speech mixtures. In: Proc. Int. Conf. on Digital Signal Processing (DSP), Cardiff, Wales UK, July 2007, pp. 223–226 (2007)
Google Scholar
Sanei, S., Naqvi, S.M., Chambers, J.A., Hicks, Y.: A geometrically constrained multimodal approach for convolutive blind source separation. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, USA, April 2007, pp. 969–972 (2007)
Google Scholar
Sodoyer, D., Girin, L., Jutten, C., Schwartz, J.-L.: Developing an audio-visual speech source separation algorithm. Speech Communication 44(1-4), 113–125 (2004)
Article Google Scholar
Sodoyer, D., Girin, L., Savariaux, C., Schwartz, J.-L., Rivet, B., Jutten, C.: A study of lip movements during spontaneous dialog and its application to voice activity detection. Journal of Acoustical Society of America 125(2), 1184–1196 (2009)
Article Google Scholar
Sodoyer, D., Rivet, B., Girin, L., Schwartz, J.-L., Jutten, C.: An analysis of visual speech information applied to voice activity detection. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, pp. 601–604 (2006)
Google Scholar
Sodoyer, D., Schwartz, J.-L., Girin, L., Klinkisch, J., Jutten, C.: Separation of audio-visual speech sources: a new approach exploiting the audiovisual coherence of speech stimuli. Eurasip Journal on Applied Signal Processing 2002(11), 1165–1173 (2002)
Article MATH Google Scholar
Stork, D.G., Hennecke, M.E.: Speechreading by Humans and Machines. Springer, Berlin (1996)
MATH Google Scholar
Sumby, W., Pollack, I.: Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 26, 212–215 (1954)
Article Google Scholar
Summerfield, Q.: Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd, B., Campbell, R. (eds.) Hearing by Eye: The Psychology of Lipreading, pp. 3–51. Lawrence Erlbaum Associates, Mahwah (1987)
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Kauai, Hawaii, USA, December 2001, pp. 511–518 (2001)
Google Scholar
Wang, W., Cosker, D., Hicks, Y., Sanei, S., Chambers, J.A.: Video assisted speech source separation. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA (March 2005)
Google Scholar

Download references

Author information

Authors and Affiliations

GIPSA-lab, CNRS UMR-5216, Grenoble INP, Grenoble, France
Bertrand Rivet
Electronic and Electrical Engineering, Loughborough University, UK
Jonathon Chambers

Authors

Bertrand Rivet
View author publications
You can also search for this author in PubMed Google Scholar
Jonathon Chambers
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Escola Politecnica Superior, Universidat de Vic, c/. Sagrada Familia, 7, 08500, Vic (Barcelona), Spain
Jordi Solé-Casals
Department of Computer Science, Escola Politecnica Superior, Universitat de Vic, c./. Sagrada Familia, 7, 08500, Vic (Barcelona), Spain
Vladimir Zaiats

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rivet, B., Chambers, J. (2010). Multimodal Speech Separation. In: Solé-Casals, J., Zaiats, V. (eds) Advances in Nonlinear Speech Processing. NOLISP 2009. Lecture Notes in Computer Science(), vol 5933. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11509-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-11509-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11508-0
Online ISBN: 978-3-642-11509-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics