Skip to main content

Label-Driven Time-Frequency Masking for Robust Speech Command Recognition

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

Abstract

Speech enhancement driven robust Automatic Speech Recognition (ASR) systems typically require a parallel corpus with noisy and clean speech utterances for training. Moreover, many studies have reported that such front-ends, though improve speech quality, do not translate into improved recognition performance. On the other hand, multi-condition training of ASR systems has little visualization or interpretability capabilities of how these systems achieve robustness. In this paper, we propose a novel neural architecture with unified enhancement and sequence classification block, that is trained in an end-to-end manner only using noisy speech without having any knowledge of clean speech. The enhancement block is a fully convolutional network that is designed to perform Time Frequency (T-F) masking like operation, followed by an LSTM sequence classification block. The T-F masking formulation enables visualization of learned mask and helps us to analyse the T-F points that are important for classification of a speech command. Experiments performed on Google Speech Command dataset show that the proposed network achieves better results than the model without an enhancement front-end.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bae, J., Kim, D.S.: End-to-end speech command recognition with capsule network. In: Proceedings of Interspeech 2018, pp. 776–780 (2018)

    Google Scholar 

  2. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)

    Google Scholar 

  3. Do, C.T., Stylianou, Y.: Improved automatic speech recognition using subband temporal envelope features and time-delay neural network denoising autoencoder. In: Proceedings of Interspeech 2017, pp. 3832–3836 (2017)

    Google Scholar 

  4. Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.R., Lee, C.H.: Robust speech recognition with speech enhanced deep neural networks. In: Proceedings of Interspeech (2014)

    Google Scholar 

  5. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)

    Google Scholar 

  6. Janod, K., Morchid, M., Dufour, R., Linares, G., De Mori, R.: Denoised bottleneck features from deep autoencoders for telephone conversation analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 25(9), 1809–1820 (2017)

    Article  Google Scholar 

  7. Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE (2017)

    Google Scholar 

  8. Maas, A.L., Le, Q.V., O’Neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust ASR. In: Proceedings of Interspeech (2012)

    Google Scholar 

  9. Marchi, E., Vesperini, F., Eyben, F., Squartini, S., Schuller, B.: A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1996–2000. IEEE (2015)

    Google Scholar 

  10. Mitra, V., Franco, H.: Leveraging deep neural network activation entropy to cope with unseen data in speech recognition. arXiv preprint arXiv:1708.09516 (2017)

  11. Mitra, V., et al.: Robust features in deep-learning-based speech recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds.) New Era for Robust Speech Recognition, pp. 187–217. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_8

    Chapter  Google Scholar 

  12. Myer, S., Tomar, V.S.: Efficient keyword spotting using time delay neural networks. arXiv preprint arXiv:1807.04353 (2018)

  13. Narayanan, A., Wang, D.: Joint noise adaptive training for robust automatic speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2504–2508. IEEE (2014)

    Google Scholar 

  14. Qian, Y., Bi, M., Tan, T., Yu, K.: Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)

    Article  Google Scholar 

  15. Qian, Y., Yin, M., You, Y., Yu, K.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 310–316. IEEE (2015)

    Google Scholar 

  16. Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398–7402. IEEE (2013)

    Google Scholar 

  17. Tang, R., Lin, J.: Deep residual learning for small-footprint keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. IEEE (2018)

    Google Scholar 

  18. Wang, Z.Q., Wang, D.: A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 796–806 (2016)

    Article  Google Scholar 

  19. Warden, P.: Speech commands: A public dataset for single-word speech recognition. Dataset available from http://download.tensorflow.org/data/speech_commands_v01 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meet Soni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Soni, M., Sheikh, I., Kopparapu, S.K. (2019). Label-Driven Time-Frequency Masking for Robust Speech Command Recognition. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics