Abstract
The work of Bernstein and Benoît has confirmed that it is advantageous to use multiple senses, for example to employ both audio and visual modalities, in speech perception. As a consequence, looking at the speaker’s face can be useful to better hear a speech signal in a noisy environment and to extract it from competing sources, as originally identified by Cherry, who posed the so-called “Cocktail Party” problem. To exploit the intrinsic coherence between audition and vision within a machine, the method of blind source separation (BSS) is particularly attractive.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aubrey, A., Rivet, B., Hicks, Y., Girin, L., Chambers, J., Jutten, C.: Two novel visual voice activity detectors based on appearance models and retinal filltering. In: Proc. European Signal Processing Conference (EUSIPCO), Poznan, Poland, September 2007, pp. 2409–2413 (2007)
Benoît, C., Mohamadi, T., Kandel, S.: Effects of phonetic context on audio-visual intelligibility of French. J. Speech and Hearing Research 37, 1195–1293 (1994)
Bernstein, L.E., Auer, E.T.J., Takayanagi, S.: Auditory speech detection in noise enhanced by lipreading. Speech Communication 44(1-4), 5–18 (2004)
Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. Journal of Acoustical Society of America 25(5), 975–979 (1953)
Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3), 287–314 (1994)
Deligne, S., Potamianos, G., Neti, C.: Audio-Visual speech enhancement with AVCDCN (AudioVisual Codebook Dependent Cepstral Normalization). In: Proc. Int. Conf. Spoken Language Processing (ICSLP), Denver, Colorado, USA, September 2002, pp. 1449–1452 (2002)
Ephraim, Y., Malah, D.: Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(6), 1109–1121 (1984)
Erber, N.P.: Interaction of audition et vision in the recognition of oral speech stimuli. J. Speech and Hearing Research 12, 423–425 (1969)
Gannot, S., Burshtein, D., Weinstein, E.: Iterative and sequential kalman filter-based speech enhancement algorithms. IEEE Transactions on Speech and Audio Processing 6(4), 373–385 (1998)
Girin, L., Allard, A., Schwartz, J.-L.: Speech signals separation: a new approach exploiting the coherence of audio and visual speech. In: IEEE Int. Workshop on Multimedia Signal Processing (MMSP), Cannes, France (2001)
Goecke, R., Potamianos, G., Neti, C.: Noisy audio feature enhancement using audio-visual speech data. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Orlando, USA, May 2002, pp. 2025–2028 (2002)
Grant, K.W., Seitz, P.-F.: The use of visible speech cues for improving auditory detection of spoken sentences. Journal of Acoustical Society of America 108, 1197–1208 (2000)
Hérault, J., Jutten, C.: Space or time adaptive signal processing by neural networks models. In: Intern. Conf. on Neural Networks for Computing, Snowbird, USA, pp. 206–211 (1986)
Jutten, C., Hérault, J.: Blind separation of sources. Part I: An adaptive algorithm based on a neuromimetic architecture. Signal Processing 24(1), 1–10 (1991)
Jutten, C., Taleb, A.: Source separation: from dusk till dawn. In: Proc. Int. Conf. Independent Component Analysis and Blind Source Separation (ICA), Helsinki, Finland, June 2000, pp. 15–26 (2000)
Kim, J., Chris, D.: Investigating the audio–visual speech detection advantage. Speech Communication 44(1-4), 19–30 (2004)
McGurk, H., McDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Milner, B., Almajai, I.: Noisy audio speech enhancement using wiener filters dervied from visual speech. In: Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), Moreton Island, Australia (September 2007)
Naqvi, S.M., Zhang, Y., Tsalaile, T., Sanei, S., Chambers, J.A.: A multimodal approach for frequency domain independent component analysis with geometrically-based initialization. In: Proc. EUSIPCO, Lausanne, Switzerland (2008)
Naqvi, S.M., Zhang, Y., Tsalaile, T., Sanei, S., Chambers, J.A.: A multimodal approach for frequency domain independent component analysis with geometricallybased initialization. In: Proc. EUSIPCO, Lausanne, Switzerland (2008)
Potamianos, G., Neti, C., Deligne, S.: Joint Audio-Visual Speech Processing for Recognition and Enhancement. In: Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), St. Jorioz, France (September 2003)
Rivet, B., Girin, L., Jutten, C.: Solving the indeterminations of blind source separation of convolutive speech mixtures. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA, March 2005, pp. V-533–V-536 (2005)
Rivet, B., Girin, L., Jutten, C.: Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Transactions on Audio, Speech and Language Processing 15(3), 796–802 (2007)
Rivet, B., Girin, L., Jutten, C.: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Transactions on Audio, Speech and Language Processing 15(1), 96–108 (2007)
Rivet, B., Girin, L., Jutten, C.: Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Communication 49(7-8), 667–677 (2007)
Rivet, B., Girin, L., Servière, C., Pham, D.-T., Jutten, C.: Using a visual voice activity detector to regularize the permutations in blind source separation of convolutive speech mixtures. In: Proc. Int. Conf. on Digital Signal Processing (DSP), Cardiff, Wales UK, July 2007, pp. 223–226 (2007)
Sanei, S., Naqvi, S.M., Chambers, J.A., Hicks, Y.: A geometrically constrained multimodal approach for convolutive blind source separation. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, USA, April 2007, pp. 969–972 (2007)
Sodoyer, D., Girin, L., Jutten, C., Schwartz, J.-L.: Developing an audio-visual speech source separation algorithm. Speech Communication 44(1-4), 113–125 (2004)
Sodoyer, D., Girin, L., Savariaux, C., Schwartz, J.-L., Rivet, B., Jutten, C.: A study of lip movements during spontaneous dialog and its application to voice activity detection. Journal of Acoustical Society of America 125(2), 1184–1196 (2009)
Sodoyer, D., Rivet, B., Girin, L., Schwartz, J.-L., Jutten, C.: An analysis of visual speech information applied to voice activity detection. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, pp. 601–604 (2006)
Sodoyer, D., Schwartz, J.-L., Girin, L., Klinkisch, J., Jutten, C.: Separation of audio-visual speech sources: a new approach exploiting the audiovisual coherence of speech stimuli. Eurasip Journal on Applied Signal Processing 2002(11), 1165–1173 (2002)
Stork, D.G., Hennecke, M.E.: Speechreading by Humans and Machines. Springer, Berlin (1996)
Sumby, W., Pollack, I.: Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 26, 212–215 (1954)
Summerfield, Q.: Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd, B., Campbell, R. (eds.) Hearing by Eye: The Psychology of Lipreading, pp. 3–51. Lawrence Erlbaum Associates, Mahwah (1987)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Kauai, Hawaii, USA, December 2001, pp. 511–518 (2001)
Wang, W., Cosker, D., Hicks, Y., Sanei, S., Chambers, J.A.: Video assisted speech source separation. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA (March 2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rivet, B., Chambers, J. (2010). Multimodal Speech Separation. In: Solé-Casals, J., Zaiats, V. (eds) Advances in Nonlinear Speech Processing. NOLISP 2009. Lecture Notes in Computer Science(), vol 5933. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11509-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-11509-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11508-0
Online ISBN: 978-3-642-11509-7
eBook Packages: Computer ScienceComputer Science (R0)