Robust Multi-stream Keyword and Non-linguistic Vocalization Detection for Computationally Intelligent Virtual Agents

Wöllmer, Martin; Marchi, Erik; Squartini, Stefano; Schuller, Björn

doi:10.1007/978-3-642-21090-7_58

Martin Wöllmer²¹,
Erik Marchi²²,
Stefano Squartini²² &
…
Björn Schuller²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6676))

Included in the following conference series:

International Symposium on Neural Networks

2337 Accesses
6 Citations

Abstract

Systems for keyword and non-linguistic vocalization detection in conversational agent applications need to be robust with respect to background noise and different speaking styles. Focussing on the Sensitive Artificial Listener (SAL) scenario which involves spontaneous, emotionally colored speech, this paper proposes a multi-stream model that applies the principle of Long Short-Term Memory to generate context-sensitive phoneme predictions which can be used for keyword detection. Further, we investigate the incorporation of noisy training material in order to create noise robust acoustic models. We show that both strategies can improve recognition performance when evaluated on spontaneous human-machine conversations as contained in the SEMAINE database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

McTear, M.F.: Spoken dialogue technology: enabling the conversational user interface. ACM Computing Surverys 34(1), 90–169 (2002)
Article Google Scholar
Droppo, J., Acero, A.: Environmental robustness. In: Handbook of Speech Processing, pp. 658–659. Springer, Heidelberg (2007)
Google Scholar
Schuller, B., Wöllmer, M., Moosmayr, T., Rigoll, G.: Recognition of noisy speech: A comparative survey of robust model architecture and feature enhancement. Journal on Audio, Speech, and Music Processing (2009), ID 942617
Google Scholar
Zhu, Q., Chen, B., Morgan, N., Stolcke, A.: Tandem connectionist feature extraction for conversational speech recognition. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 223–231. Springer, Heidelberg (2005)
Chapter Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks, pp. 1–15. IEEE Press, Los Alamitos (2001)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Article Google Scholar
Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cognitive Computation 2(3), 180–190 (2010)
Article Google Scholar
Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: Recognition of spontaneous conversational speech using long short-term memory phoneme predictions. In: Proc. of Interspeech, Makuhari, Japan, pp. 1946–1949 (2010)
Google Scholar
Schröder, M., Cowie, R., Heylen, D., Pantic, M., Pelachaud, C., Schuller, B.: Towards responsive sensitive artificial listeners. In: Proc. of 4th Intern. Workshop on Human-Computer Conversation, Bellagio, Italy, pp. 1–6 (2008)
Google Scholar
Wöllmer, M., Schuller, B., Eyben, F., Rigoll, G.: Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE Journal of Selected Topics in Signal Processing 4(5), 867–881 (2010)
Article Google Scholar
Stupakov, A., Hanusa, E., Bilmes, J., Fox, D.: COSINE - a corpus of multi-party conversational speech in noisy environments. In: Proc. of ICASSP, Taipei, Taiwan (2009)
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the Munich versatile and fast open-source audio feature extractor. In: Proc. of ACM Multimedia, Firenze, Italy, pp. 1459–1462 (2010)
Google Scholar
Principi, E., Cifani, S., Rocchi, C., Squartini, S., Piazza, F.: Keyword spotting based system for conversation fostering in tabletop scenarios: preliminary evaluation. In: Proc. of HSI, Catania, Italy, pp. 216–219 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Human-Machine Communication, Technische Universität München, 80333, München, Germany
Martin Wöllmer & Björn Schuller
3MediaLabs - A3LAB, DIBET - Dipartimento di Ingegneria Biomedica, Elettronica e Telecomunicazioni, Università Politecnica delle Marche, 60131, Ancona, Italy
Erik Marchi & Stefano Squartini

Authors

Martin Wöllmer
View author publications
You can also search for this author in PubMed Google Scholar
Erik Marchi
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Squartini
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Automation, Key Laboratory of Complex Systems and Intelligence Science, Chinese Academy of Sciences, 100190, Beijing, China
Derong Liu
College of Information Science and Engineering, Northeastern University, 110004, Shenyang, Liaoing, China
Huaguang Zhang
Department of Electrical and Computer Engineering, University of Cyprus, 75 Kallipoleos Avenue, 1678, Nicosia, Cyprus
Marios Polycarpou
Dipartimento di Elettronica, Politecnico di Milano, Piazza L. da Vinci 32, 20133, Milano, Italy
Cesare Alippi
Deptartment of Electrical, Computer and Biomedical Engineering, University of Rhode Island, 02881, Kingston, RI, USA
Haibo He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wöllmer, M., Marchi, E., Squartini, S., Schuller, B. (2011). Robust Multi-stream Keyword and Non-linguistic Vocalization Detection for Computationally Intelligent Virtual Agents. In: Liu, D., Zhang, H., Polycarpou, M., Alippi, C., He, H. (eds) Advances in Neural Networks – ISNN 2011. ISNN 2011. Lecture Notes in Computer Science, vol 6676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21090-7_58

Download citation

DOI: https://doi.org/10.1007/978-3-642-21090-7_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21089-1
Online ISBN: 978-3-642-21090-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics