Skip to main content

Robust Multi-stream Keyword and Non-linguistic Vocalization Detection for Computationally Intelligent Virtual Agents

  • Conference paper
Advances in Neural Networks – ISNN 2011 (ISNN 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6676))

Included in the following conference series:

Abstract

Systems for keyword and non-linguistic vocalization detection in conversational agent applications need to be robust with respect to background noise and different speaking styles. Focussing on the Sensitive Artificial Listener (SAL) scenario which involves spontaneous, emotionally colored speech, this paper proposes a multi-stream model that applies the principle of Long Short-Term Memory to generate context-sensitive phoneme predictions which can be used for keyword detection. Further, we investigate the incorporation of noisy training material in order to create noise robust acoustic models. We show that both strategies can improve recognition performance when evaluated on spontaneous human-machine conversations as contained in the SEMAINE database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. McTear, M.F.: Spoken dialogue technology: enabling the conversational user interface. ACM Computing Surverys 34(1), 90–169 (2002)

    Article  Google Scholar 

  2. Droppo, J., Acero, A.: Environmental robustness. In: Handbook of Speech Processing, pp. 658–659. Springer, Heidelberg (2007)

    Google Scholar 

  3. Schuller, B., Wöllmer, M., Moosmayr, T., Rigoll, G.: Recognition of noisy speech: A comparative survey of robust model architecture and feature enhancement. Journal on Audio, Speech, and Music Processing (2009), ID 942617

    Google Scholar 

  4. Zhu, Q., Chen, B., Morgan, N., Stolcke, A.: Tandem connectionist feature extraction for conversational speech recognition. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 223–231. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  5. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks, pp. 1–15. IEEE Press, Los Alamitos (2001)

    Google Scholar 

  6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  7. Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cognitive Computation 2(3), 180–190 (2010)

    Article  Google Scholar 

  8. Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: Recognition of spontaneous conversational speech using long short-term memory phoneme predictions. In: Proc. of Interspeech, Makuhari, Japan, pp. 1946–1949 (2010)

    Google Scholar 

  9. Schröder, M., Cowie, R., Heylen, D., Pantic, M., Pelachaud, C., Schuller, B.: Towards responsive sensitive artificial listeners. In: Proc. of 4th Intern. Workshop on Human-Computer Conversation, Bellagio, Italy, pp. 1–6 (2008)

    Google Scholar 

  10. Wöllmer, M., Schuller, B., Eyben, F., Rigoll, G.: Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE Journal of Selected Topics in Signal Processing 4(5), 867–881 (2010)

    Article  Google Scholar 

  11. Stupakov, A., Hanusa, E., Bilmes, J., Fox, D.: COSINE - a corpus of multi-party conversational speech in noisy environments. In: Proc. of ICASSP, Taipei, Taiwan (2009)

    Google Scholar 

  12. Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the Munich versatile and fast open-source audio feature extractor. In: Proc. of ACM Multimedia, Firenze, Italy, pp. 1459–1462 (2010)

    Google Scholar 

  13. Principi, E., Cifani, S., Rocchi, C., Squartini, S., Piazza, F.: Keyword spotting based system for conversation fostering in tabletop scenarios: preliminary evaluation. In: Proc. of HSI, Catania, Italy, pp. 216–219 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wöllmer, M., Marchi, E., Squartini, S., Schuller, B. (2011). Robust Multi-stream Keyword and Non-linguistic Vocalization Detection for Computationally Intelligent Virtual Agents. In: Liu, D., Zhang, H., Polycarpou, M., Alippi, C., He, H. (eds) Advances in Neural Networks – ISNN 2011. ISNN 2011. Lecture Notes in Computer Science, vol 6676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21090-7_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21090-7_58

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21089-1

  • Online ISBN: 978-3-642-21090-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics