Label-Driven Time-Frequency Masking for Robust Speech Command Recognition

Soni, Meet; Sheikh, Imran; Kopparapu, Sunil Kumar

doi:10.1007/978-3-030-27947-9_29

Meet Soni⁹,
Imran Sheikh⁹ &
Sunil Kumar Kopparapu⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

808 Accesses
1 Citations

Abstract

Speech enhancement driven robust Automatic Speech Recognition (ASR) systems typically require a parallel corpus with noisy and clean speech utterances for training. Moreover, many studies have reported that such front-ends, though improve speech quality, do not translate into improved recognition performance. On the other hand, multi-condition training of ASR systems has little visualization or interpretability capabilities of how these systems achieve robustness. In this paper, we propose a novel neural architecture with unified enhancement and sequence classification block, that is trained in an end-to-end manner only using noisy speech without having any knowledge of clean speech. The enhancement block is a fully convolutional network that is designed to perform Time Frequency (T-F) masking like operation, followed by an LSTM sequence classification block. The T-F masking formulation enables visualization of learned mask and helps us to analyse the T-F points that are important for classification of a speech command. Experiments performed on Google Speech Command dataset show that the proposed network achieves better results than the model without an enhancement front-end.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bae, J., Kim, D.S.: End-to-end speech command recognition with capsule network. In: Proceedings of Interspeech 2018, pp. 776–780 (2018)
Google Scholar
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)
Google Scholar
Do, C.T., Stylianou, Y.: Improved automatic speech recognition using subband temporal envelope features and time-delay neural network denoising autoencoder. In: Proceedings of Interspeech 2017, pp. 3832–3836 (2017)
Google Scholar
Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.R., Lee, C.H.: Robust speech recognition with speech enhanced deep neural networks. In: Proceedings of Interspeech (2014)
Google Scholar
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)
Google Scholar
Janod, K., Morchid, M., Dufour, R., Linares, G., De Mori, R.: Denoised bottleneck features from deep autoencoders for telephone conversation analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 25(9), 1809–1820 (2017)
Article Google Scholar
Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. IEEE (2017)
Google Scholar
Maas, A.L., Le, Q.V., O’Neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust ASR. In: Proceedings of Interspeech (2012)
Google Scholar
Marchi, E., Vesperini, F., Eyben, F., Squartini, S., Schuller, B.: A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1996–2000. IEEE (2015)
Google Scholar
Mitra, V., Franco, H.: Leveraging deep neural network activation entropy to cope with unseen data in speech recognition. arXiv preprint arXiv:1708.09516 (2017)
Mitra, V., et al.: Robust features in deep-learning-based speech recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds.) New Era for Robust Speech Recognition, pp. 187–217. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_8
Chapter Google Scholar
Myer, S., Tomar, V.S.: Efficient keyword spotting using time delay neural networks. arXiv preprint arXiv:1807.04353 (2018)
Narayanan, A., Wang, D.: Joint noise adaptive training for robust automatic speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2504–2508. IEEE (2014)
Google Scholar
Qian, Y., Bi, M., Tan, T., Yu, K.: Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)
Article Google Scholar
Qian, Y., Yin, M., You, Y., Yu, K.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 310–316. IEEE (2015)
Google Scholar
Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398–7402. IEEE (2013)
Google Scholar
Tang, R., Lin, J.: Deep residual learning for small-footprint keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. IEEE (2018)
Google Scholar
Wang, Z.Q., Wang, D.: A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 796–806 (2016)
Article Google Scholar
Warden, P.: Speech commands: A public dataset for single-word speech recognition. Dataset available from http://download.tensorflow.org/data/speech_commands_v01 (2017)

Download references

Author information

Authors and Affiliations

TCS Research and Innovation, Mumbai, India
Meet Soni, Imran Sheikh & Sunil Kumar Kopparapu

Authors

Meet Soni
View author publications
You can also search for this author in PubMed Google Scholar
Imran Sheikh
View author publications
You can also search for this author in PubMed Google Scholar
Sunil Kumar Kopparapu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meet Soni .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soni, M., Sheikh, I., Kopparapu, S.K. (2019). Label-Driven Time-Frequency Masking for Robust Speech Command Recognition. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_29
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics