Abstract
This chapter presents a literature review that places the research proposed in this book in context, building on the background presented in the previous chapters. Firstly, the overall speech processing domain is briefly discussed. This review presents examples of listening devices using directional microphones, array microphones, noise reduction algorithms, and rule based automatic decision making, demonstrating that the multimodal two stage framework presented later in this book has established precedent in the context of real world hearing aid devices. The other significant aspect vital to the research context of this work is the field of audiovisual speech filtering. This chapter presents a review of multimodal speech enhancement, with a discussion of the initial early stage audiovisual speech filtering systems in the literature, and the subsequent development and diversification of this field. A number of different state of the art speech filtering systems are examined and reviewed in depth, particularly multimodal beamforming and Wiener filtering. Finally, several audiovisual speech databases are evaluated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
K. Chung, Challenges and recent developments in hearing aids. Part i. Speech understanding in noise, microphone technologies and noise reduction algorithms. Trends Amplif. 8(3), 83–124 (2004)
T. Ricketts, H. Mueller, Making sense of directional microphone hearing aids. Am. J. Audiol. 8(2), 117 (1999)
M. Valente, Use of microphone technology to improve user performance in noise, Textbook of Hearing Aid Amplification (Singular Thomason Learning, San Diego, 2000), p. 247
F. Kuk, D. Keenan, C. Lau, C. Ludvigsen, Performance of a fully adaptive directional microphone to signals presented from various azimuths. J. Am. Acad. Audiol. 16(6), 333–347 (2005)
M. Cord, R. Surr, B. Walden, L. Olson, Performance of directional microphone hearing aids in everyday life. J. Am. Acad. Audiol. 13(6), 295–307 (2002)
M. Cord, R. Surr, B. Walden, O. Dyrlund, Relationship between laboratory measures of directional advantage and everyday success with directional microphone hearing aids. J. Am. Acad. Audiol. 15(5), 353–364 (2004)
T. Ricketts, P. Henry, Evaluation of an adaptive, directional-microphone hearing aid: evaluación de un auxiliar auditivo de micrófono direccional adaptable. Int. J. Audiol. 41(2), 100–112 (2002)
R. Bentler, C. Palmer, A. Dittberner, Hearing-in-noise: comparison of listeners with normal and (aided) impaired hearing. J. Am. Acad. Audiol. 15(3), 216–225 (2004)
L. Mens, Speech understanding in noise with an eyeglass hearing aid: asymmetric fitting and the head shadow benefit of anterior microphones. Int. J. Audiol. 50(1), 27–33 (2011)
L. Christensen, D. Helmink, W. Soede, M. Killion, Complaints about hearing in noise: a new answer. Hear. Rev. 9(6), 34–36 (2002)
S. Laugesen, T. Schmidtke, Improving on the speech-in-noise problem with wireless array technology. News from Oticon (2004), pp. 3–23
S. Rosen, Temporal information in speech: acoustic, auditory and linguistic aspects. Philos. Trans.: Biol. Sci. 336, 367–373 (1992)
N. Tellier, H. Arndt, H. Luo, Speech or noise? Using signal detection and noise reduction. Hear. Rev. 10(6), 48–51 (2003)
H. Levitt, Noise reduction in hearing aids: an overview. J. Rehabil. Res. Dev. 38(1), 111–121 (2001)
M. Boymans, W. Dreschler, P. Schoneveld, H. Verschuure, Clinical evaluation of a full-digital in-the-ear hearing instrument. Int. J. Audiol. 38(2), 99–108 (1999)
J. Alcántara, B. Moore, V. Kühnel, S. Launer, Evaluation of the noise reduction system in a commrcial digital hearing aid: evaluación del sistema de reducción de ruido en un auxiliar auditivo digital comercial. Int. J. Audiol. 42(1), 34–42 (2003)
C. Elberling, About the voicefinder. News from Oticon (2002)
D. Schum, Noise-reduction circuitry in hearing aids: (2) goals and current strategies. Hear. J. 56(6), 32 (2003)
L. Girin, J. Schwartz, G. Feng, Audio-visual enhancement of speech in noise. J. Acoust. Soc. Am. 109, 3007 (2001)
R. Goecke, G. Potamianos, C. Neti, Noisy audio feature enhancement using audio-visual speech data, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 2 (IEEE, 2002), pp. 2025–2028
S. Deligne, G. Potamianos, C. Neti, Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization), in Proceedings of the Sensor Array and Multichannel Signal Processing Workshop (IEEE, 2003), pp. 68–71
A. Acero, R. Stern, Environmental robustness in automatic speech recognition, in Proceedings of the International Conference on Acoustics, Speech, and Signal ProcessingICASSP-90 (IEEE, 2002), pp. 849–852
L. Deng, A. Acero, L. Jiang, J. Droppo, X. Huang, High-performance robust speech recognition using stereo training data, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), vol. 1 (IEEE, 2002), pp. 301–304
B. Rivet, J. Chambers, Multimodal speech separation, in Advances in Nonlinear Speech Processing, vol. 5933, Lecture Notes in Computer Science, ed. by J. Sole-Casals, V. Zaiats (Springer, Berlin, 2010), pp. 1–11
B. Rivet, L. Girin, C. Jutten, Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Trans. Audio Speech Lang. Process. 15(3), 796–802 (2007)
B. Rivet, L. Girin, C. Jutten, Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 15(1), 96–108 (2007)
B. Rivet, L. Girin, C. Serviere, D.-T. Pham, C. Jutten, Using a visual voice activity detector to regularize the permutations in blind separation of convolutive speech mixtures, in Proceedings of the 15th International Conference on Digital Signal Processing (2007), pp. 223 –226
B. Rivet, L. Girin, C. Jutten, Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Commun. 49(7–8), 667–677 (2007)
C. Jutten, J. Herault, Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture. Signal Process. 24(1), 1–10 (1991)
J. Herault, C. Jutten, B. Ans, Detection de grandeurs primitives dans un message composite par une architecture de calcul neuromimetrique en apprentissage non supervise. Actes du Xeme colloque GRETSI 2, 1017–1020 (1985)
E. Cherry, Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)
L. Girin, G. Feng, J. Schwartz, Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (IEEE, 2002), pp. 1005–1008
D. Sodoyer, L. Girin, C. Jutten, J. Schwartz, Developing an audio-visual speech source separation algorithm. Speech Commun. 44(1–4), 113–125 (2004)
D. Sodoyer, J. Schwartz, L. Girin, J. Klinkisch, C. Jutten, Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli. EURASIP J. Appl. Signal Process. 2002(1), 1165–1173 (2002)
S. Naqvi, M. Yu, J. Chambers, A multimodal approach to blind source separation of moving sources. IEEE J. Sel. Top. Signal Process. 4(5), 895–910 (2010)
P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (IEEE Computer Society, 2001), pp. 511–518
A. Hyvarinen, J. Karhunen, E. Oja, Independent Component Analysis, vol. 26 (Wiley-Interscience, New York, 2001)
E. Bingham, A. Hyvarinen, A fast fixed-point algorithm for independent component analysis of complex valued signals. Int. J. Neural Syst. 10(1), 1–8 (2000)
J. Barker, X. Shao, Audio-visual speech fragment decoding, in Proceedings of the International Conference on Auditory-Visual Speech Processing (2007), pp. 37–42
J. Barker, M. Cooke, D. Ellis, Decoding speech in the presence of other sources. Speech Commun. 45(1), 5–25 (2005)
A. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound (The MIT Press, Cambridge, 1990)
A. Bregman, Auditory Scene Analysis: Hearing in Complex Environments (Oxford University Press, Oxford, 1993)
J. Barker, X. Shao, Energetic and informational masking effects in an audiovisual speech recognition system. IEEE Trans. Audio Speech Lang. Process. 17(3), 446–458 (2009)
M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5 Pt 1), 2421–2424 (2006)
I. Almajai, B. Milner, in Proceedings of the Enhancing Audio Speech using Visual Speech Features (Interspeech, Brighton, 2009)
N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications (The MIT Press, Cambridge, 1949)
I. Almajai, B. Milner, Maximising audio-visual speech correlation, in Proceedings of the AVSP (2007)
I. Almajai, B. Milner, J. Darch, S. Vaseghi, Visually-derived Wiener filters for speech enhancement, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 4 (2007), pp. 585–588
I. Almajai, B. Milner, in Proceedings of the Effective Visually-derived Wiener Filtering For Audio-visual Speech Processing (Interspeech, Brighton, UK, 2009)
B. Milner, I. Almajai, Noisy audio speech enhancement using Wiener filters derived from visual speech, in Proceedings of the International Workshop on Auditory-Visual Speech Processing (AVSP)
B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, T. Huang, AVICAR: audio-visual speech corpus in a car environment, in Proceedings of the Conference on Spoken Language, Jeju, Korea (Citeseer, 2004), pp. 2489–2492
H. Lane, B. Tranel, The Lombard sign and the role of hearing in speech. J. Speech Hear. Res. 14(4), 677 (1971)
T. Wakasugi, M. Nishiura, K. Fukui, Robust lip contour extraction using separability of multi-dimensional distributions, in Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (IEEE, 2004), pp. 415–420
A. Liew, S. Leung, W. Lau, Lip contour extraction from color images using a deformable model. Pattern Recognit. 35(12), 2949–2962 (2002)
Q. Nguyen, M. Milgram, Semi adaptive appearance models for lip tracking, in Proceedings of the ICIP09 (2009), pp. 2437–2440
M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models. Int. J. Comput. Vis. 1, 321–331 (1988)
A. Das, D. Ghoshal, Extraction of time invariant lips based on morphological operation and corner detection method. Int. J. Comput. Appl. 48(21), 7–11 (2012)
Y. Cheung, X. Liu, X. You, A local region based approach to lip tracking. Pattern Recognit. 45, 3336–3347 (2012)
X. Zhang, R. Mersereau, Lip feature extraction towards an automatic speechreading system, in Proceedings of the 2000 International Conference on Image Processing, vol. 3 (IEEE, 2000), pp. 226–229
N. Eveno, A. Caplier, P. Coulon, New color transformation for lips segmentation, in IEEE Fourth Workshop on Multimedia Signal Processing (IEEE, 2001), pp. 3–8
N. Eveno, A. Caplier, P. Coulon, Key points based segmentation of lips, in Proceedings of the 2002 IEEE International Conference on Multimedia and Expo, ICME’02, vol. 2, (IEEE, 2002), pp. 125–128
D. Freedman, M. Brandstein, Contour tracking in clutter: a subset approach. Int. J. Comput. Vis. 38(2), 173–186 (2000)
Z. Ji, Y. Su, J. Wang, R. Hua, Robust sea-sky-line detection based on horizontal projection and hough transformation, in 2nd International Congress on Image and Signal Processing, CISP’09 (IEEE, 2009), pp. 1–4
C. Harris, M. Stephens, A combined corner and edge detector, in Alvey Vision Conference, vol. 15 (Manchester, 1988), p. 50
J. Luettin, N. Thacker, S. Beet, Visual speech recognition using active shape models and hidden Markov models, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 2 (IEEE, 1996), pp. 817–820
Q. Nguyen, M. Milgram, T. Nguyen, Multi features models for robust lip tracking, in 10th International Conference on Control, Automation, Robotics and Vision, 2008. ICARCV 2008, (IEEE, 2008), pp. 1333–1337
T. Cootes, G. Edwards, C. Taylor, Active appearance models, inComputer Vision-ECCV’98 (1998), pp. 484–498
A. Yuille, P. Hallinan, D. Cohen, Feature extraction from faces using deformable templates. Int. J Comput. Vis. 8(2), 99–111 (1992)
G. Chiou, J. Hwang, Lipreading from color video. IEEE Trans. Image Process. 6(8), 1192–1195 (1997)
M. Yang, D. Kriegman, N. Ahuja, Detecting faces in images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24(1), 34–58 (2002)
S. Wang, A. Abdel-Dayem, Improved viola-jones face detector, in Proceedings of the 1st Taibah University International Conference on Computing and Information Technology, ICCIT’12 (2012), pp. 321–328
C. Kotropoulos, I. Pitas, Rule-based face detection in frontal views, in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-97, vol. 4, (IEEE, 1997), pp. 2537–2540
G. Yang, T. Huang, Human face detection in a complex background. Pattern Recognit. 27(1), 53–63 (1994)
R. Kjeldsen, J. Kender, Finding skin in color images, in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (IEEE, 1996), pp. 312–317
K. Yow, R. Cipolla, A probabilistic framework for perceptual grouping of features for human face detection, in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, (IEEE, 1996), pp. 16–21
T. Kohonen, Self-organisation and Associative Memory (Springer, Berlin, 1989)
K. Sung, Learning and example selection for object and pattern detection (1996)
T. Agui, Y. Kokubo, H. Nagahashi, T. Nagao, Extraction of face regions from monochromatic photographs using neural networks, in Proceedings of the International Conference on Robotics (1992)
F. Crow, Summed-area tables for texture mapping. Comput. Graph. 18(3), 207–212 (1984)
G. Bradski, The OpenCV Library. Dr. Dobb’s J. Softw. Tools 25(11), 120–126 (2000)
C. Zhang, Z. Zhang, A survey of recent advances in face detection. Microsoft Research, June 2010
R. Meir, G. Rätsch, An introduction to boosting and leveraging, Advanced Lectures on Machine Learning (Springer, New York, 2003), pp. 118–183
Y. Freund, R. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Computational Learning Theory (Springer, Berlin, 1995), pp. 23–37
J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Ann. Stat. 28(2), 337–407 (2000)
S. Brubaker, J. Wu, J. Sun, M. Mullin, J. Rehg, On the design of cascades of boosted ensembles for face detection. Int. J. Comput. Vis. 77(1), 65–86 (2008)
S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, H. Shum, Statistical learning of multi-view face detection. in Computer Vision, ECCV 2002 (2006), pp. 117–121
C. Bishop, P. Viola, Learning and vision: discriminative methods. ICCV Course Lear. Vis. 2(7), 11 (2003)
R. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions. Mach. Lear. 37(3), 297–336 (1999)
X. Huang, S. Li, Y. Wang, Jensen-Shannon boosting learning for object recognition, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, vol. 2 (IEEE, 2005), pp. 144–149
E. Patterson, S. Gurbuz, Z. Tufekci, J. Gowdy, Cuave: a new audio-visual database for multimodal human-computer interface research, in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-93, vol. 2 (IEEE, 2002), p. II
E. Bailly-Bailliere, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée et al., The BANCA database and evaluation protocol, Audio- and Video-Based Biometric Person Authentication (Springer, 2003), p. 1057
K. Messer, J. Matas, J. Kittler, J. Luettin, G. Maitre, XM2VTSDB: the extended M2VTS database, in Second International Conference on Audio and Video-based Biometric Person Authentication, vol. 964 (Citeseer, 1999), pp. 965–966
C. Sanderson, K. Paliwal, Polynomial features for robust face authentication, in Proceedings of the International Conference on Image Processing, vol. 3 (IEEE, 2002), pp. 997–1000
C. Sanderson, Biometric Person Recognition: Face, Speech and Fusion (VDM Verlag Dr, Muller, 2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 The Author(s)
About this chapter
Cite this chapter
Abel, A., Hussain, A. (2015). The Research Context. In: Cognitively Inspired Audiovisual Speech Filtering. SpringerBriefs in Cognitive Computation, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-319-13509-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-13509-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13508-3
Online ISBN: 978-3-319-13509-0
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)