An Improvement in Audio-Visual Voice Activity Detection for Automatic Speech Recognition

Yoshida, Takami; Nakadai, Kazuhiro; Okuno, Hiroshi G.

doi:10.1007/978-3-642-13022-9_6

Takami Yoshida²⁴,
Kazuhiro Nakadai^24,25 &
Hiroshi G. Okuno²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6096))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

2178 Accesses
5 Citations

Abstract

Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there are many acoustically and visually noises. In this paper, we improved Audio-Visual VAD for our two-layered audio visual integration framework for ASR by using hangover processing based on erosion and dilation. We implemented proposed method to our audio-visual speech recognition system for robot. Empirical results show the effectiveness of our proposed method in terms of VAD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Nakadai, K., Lourens, T., Okuno, H.G., Kitano, H.: Active audition for humanoid. In: Proceedings of 17th National Conference on Artificial Intelligence, pp. 832–839 (2000)
Google Scholar
Yamamoto, S., Nakadai, K., Nakano, M., Tsujino, H., Valin, J.M., Komatani, K., Ogata, T., Okuno, H.G.: Real-time robot audition system that recognizes simultaneous speech in the real world. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5333–5338 (2006)
Google Scholar
Potamianos, G., Neti, C., Iyengar, G., Senior, A., Verma, A.: A cascade visual front end for speaker independent automatic speechreading. Speech Technology, Special Issue on Multimedia 4, 193–208 (2001)
MATH Google Scholar
Tamura, S., Iwano, K., Furui, S.: A stream-weight optimization method for multi-stream hmms based on likelihood value normalization. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 469–472 (2005)
Google Scholar
Fiscus, J.: A post-processing systems to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In: Proceedings of Workshop on Automatic Speech Recognition and Understanding, pp. 347–354 (1997)
Google Scholar
Yoshida, T., Nakadai, K., Okuno, G.H.: Automatic speech recognition improved by two-layered audio-visual speech recognition for robot audition. In: Proceedings of 9th IEEE-RAS International Conference on Humanoid Robots, pp. 604–609 (2009)
Google Scholar
Liu, P., Wang, Z.: Voice activity detection using visual information. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 609–612 (2004)
Google Scholar
Rivet, B., Girin, L., Jutten, C.: Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Communication 49, 667–677 (2007)
Article Google Scholar
Murai, K., Nakamura, S.: Face-to-talk: audio-visual speech detection for robust speech recognition in noisy environment. IEICE TRANSACTIONS on Information and Systems E86-D, 505–513 (2003)
Google Scholar
Asano, F., Motomura, Y., Aso, H., Yoshimura, T., Ichimura, N., Nakamura, S.: Fusion of audio and video information for detecting speech events. In: Proceedings of International Conference on Information Fusion, pp. 386–393 (2003)
Google Scholar
Nakadai, K., Matsuura, D., Okuno, H.G., Tsujino, H.: Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots. Speech Communication 44, 97–112 (2004)
Article Google Scholar
Asano, F., Goto, M., Itou, K., Asoh, H.: Real-time sound source localization and separation system and its application to automatic speech recognition. In: Proceedings of International Conference on Speech Processing, pp. 1013–1016 (2001)
Google Scholar
Valin, J.M., Rouat, J., Michaud, F.: Enhanced robot audition based on microphone array source separation with post-filter. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2123–2128 (2004)
Google Scholar
Nishimura, Y., Shinozaki, T., Iwano, K., Furui, S.: Noise-robust speech recognition using multi-band spectral features. Acoustical Society of America Journal 116, 2480–2480 (2004)
Google Scholar
Koiwa, T., Nakadai, K., Imura, J.: Coarse speech recognition by audio-visual integration based on missing feature theory. In: Proceedings of IEEE/RAS International Conference on Intelligent Robots and Systems, pp. 1751–1756 (2007)
Google Scholar
Nishimura, Y., Ishizuka, M., Nakadai, K., Nakano, M., Tsujino, H.: Speech recognition for a humanoid with motor noise utilizing missing feature theory. In: Proceedings of 6th IEEE-RAS International Conference on Humanoid Robots, pp. 26–33 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Tokyo, Japan
Takami Yoshida & Kazuhiro Nakadai
Honda Research Institute Japan, Co., Ltd., Saitama, Japan
Kazuhiro Nakadai
Graduate School of Informatics, Kyoto University, Kyoto, Japan
Hiroshi G. Okuno

Authors

Takami Yoshida
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Nakadai
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi G. Okuno
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computing and Numerical Analysis, University of Cordoba, Campus Universitario de Rabanales, Einstein Building, 3rd floor, 14071, Cordoba, Spain
Nicolás García-Pedrajas
Dept. of Computer Science and Artificial Intelligence, ETS de Ingenierias Informática y de Telecomunicación, University of Granada, 18071, Granada, Spain
Francisco Herrera
School of Computing, University of the West of Scotland, PA1 2BE, Paisley, UK
Colin Fyfe
Dept. Computer Science and Artificial Intelligence, ETS de Ingenierias Informática y de Telecomunicación, University of Granada, 18071, Granada, Spain
José Manuel Benítez
Department of Computer Science, Texas State University-San Marcos, 601 University Drive, TX 78666-4616, San Marcos, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoshida, T., Nakadai, K., Okuno, H.G. (2010). An Improvement in Audio-Visual Voice Activity Detection for Automatic Speech Recognition. In: García-Pedrajas, N., Herrera, F., Fyfe, C., Benítez, J.M., Ali, M. (eds) Trends in Applied Intelligent Systems. IEA/AIE 2010. Lecture Notes in Computer Science(), vol 6096. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13022-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-13022-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13021-2
Online ISBN: 978-3-642-13022-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics