Learning Efficient Representations for Keyword Spotting with Triplet Loss

Vygon, Roman; Mikhaylovskiy, Nikolay

doi:10.1007/978-3-030-87802-3_69

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1884 Accesses
16 Citations

Abstract

In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most notably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA for Google Speech Commands dataset V1 10 + 2 -class classification by about 34%, achieving 98.55% accuracy, V2 10 + 2-class classification by about 20%, achieving 98.37% accuracy, and V2 35-class classification by over 50%, achieving 97.0% accuracy. (Code is available at https://github.com/roman-vygon/triplet_loss_kws).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Prototypical Metric Transfer Learning for Continuous Speech Keyword Spotting with Limited Training Data

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

On the Use of Convolutional Neural Networks in Pairwise Language Recognition

References

Tang, R., Lin, J.: Deep residual learning for small-footprint keyword spotting. In: International Conference on Acoustics, Speech and Signal Processing, pp. 5484–5488 (2018)
Google Scholar
Zhang, Y., Suda, N., Lai, L., Chandra, V.: Hello Edge: Keyword Spotting on Microcontrollers
Google Scholar
de Andrade, D., Sabato, L., Viana, M., Bernkopf, C.: A neural attention model for speech command recognition
Google Scholar
Teacher, C., Kellett, Y., Focht, L.: Experimental, limited vocabulary, speech recognizer. IEEE Trans. Audio Electroacoust. 15(3), 127–130 (1967)
Article Google Scholar
Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: Acoustics, Speech, and Signal Processing, pp. 627–630 (1989)
Google Scholar
Szöke, I., Schwarz, P., Matějka, P., Burget, L., Karafiát, M., Černocký, J.: Phoneme based acoustics keyword spotting in informal continuous speech. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 302–309. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_39
Chapter Google Scholar
Zhang, S., Shuang, Z., Shi, Q., Qin, Y.: Improved mandarin keyword spotting using confusion garbage model. In: 2010 20th International Conference on Pattern Recognition (ICPR), pp. 3700–3703
Google Scholar
Greibus, M., Telksnys, L.: Speech keyword spotting with rule based segmentation. In: Skersys, T., Butleris, R., Butkiene, R. (eds.) ICIST 2013. CCIS, vol. 403, pp. 186–197. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41947-8_17
Chapter Google Scholar
Principi, S.S., Bonfigli, R., Ferroni, G., Piazza, F.: An integrated system for voice command recognition and emergency detection based on audio signals. Expert Syst. Appl. 42(13), 5668–5683 (2015). https://doi.org/10.1016/j.eswa.2015.02.036
Article Google Scholar
Chen, G., Parada, C., Heigold, G.: Small-footprint keyword spotting using deep neural networks. In: Acoustics, Speech and Signal Processing, International Conference on, p. 4087–4091 (2014)
Google Scholar
Sainath, T.N., Parada C.: Convolutional neural networks for small-footprint keyword spotting. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Arik, S.O., et al.: Convolutional recurrent neural networks for small-footprint keyword spotting (2017)
Google Scholar
Sun, M., et al.: Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. In: Spoken Language Technology Workshop, pp. 474–480 (2016)
Google Scholar
He, Y., Prabhavalkar, R., Rao, K., Li, W., Bakhtin, A., McGraw, I.: Streaming small-footprint keyword spotting using sequence-to-sequence models. In: Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474–481 (2017)
Google Scholar
Lei, J., et al.: Low-power audio keyword spotting using Tsetlin machines. J. Low Power Electron. Appl. 11(2): 18
Google Scholar
Warden, P.: Speech commands: a public dataset for single-word speech recognition
Google Scholar
Jansson, P.: Single-word speech recognition with convolutional neural networks on raw waveforms. Degree Thesis, Information technology, ARCADA University, Finland
Google Scholar
Majumdar, S., Ginsburg, B.: MatchboxNet: 1D time-channel separable convolutional neural network architecture for speech commands recognition. In: Proceedings of Interspeech, pp. 3356–3360. https://doi.org/10.21437/Interspeech.2020-1058 (2020)
Mordido, G., Van Keirsbilck, M., Keller, A.: Compressing 1D time-channel separable convolutions using sparse random ternary matrices (2021)
Google Scholar
Rybakov O., Kononenko N., Subrahmanya N., Visontai M., Laurenzo S.: Streaming keyword spotting on mobile devices. In: Proceedings of Interspeech, pp. 2277–2281 (2020)
Google Scholar
Wei, Y., Gong, Z., Yang, S., Ye, K., Wen, Y.: EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J. Ambient. Intell. Humaniz. Comput. 1–11 (2021). https://doi.org/10.1007/s12652-021-03022-1
Tang, R., et al.: Howl: a deployed, open-source wake word detection system. In: Proceedings of Second Workshop for NLP Open-Source Software (NLP-OSS), pp. 61–65 (2020)
Google Scholar
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification
Google Scholar
Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393 (2014)
Google Scholar
Chechik, G., Sharma, V., Shalit, U., Bengio, S.: Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11, 1109–1135 (2010)
MathSciNet MATH Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015)
Google Scholar
Huang, J., Li, Y., Tao, J., Lian, Z.: Speech emotion recognition from variable-length inputs with triplet loss function. In: Proceedings of INTERSPEECH, pp. 3673–3677 (2018)
Google Scholar
Ren, M., Nie, W., Liu, A., Su, Y.: Multi-modal correlated network for emotion recognition in speech. Vis. Informat. 3(3), 150–155 (2019)
Article Google Scholar
Kumar, P., Jain, S., Raman, B, Roy, P.P., Iwamura, M.: End-to-end triplet loss based emotion embedding system for speech emotion recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 8766–8773 (2021)
Google Scholar
Harvill, J., AbdelWahab, M., Lotfian, R., Busso, C.: Retrieving speech samples with similar emotional content using a triplet loss function. In: International Conference on Acoustics, Speech and Signal Processing, Brighton, United Kingdom, pp. 7400–7404 (2019)
Google Scholar
Bredin, H.: Tristounet: triplet loss for speaker turns embedding. In: 2017 IEEE International Conference on Acoustics, speech and Signal Processing (ICASSP), pp. 5430–5434 (2017)
Google Scholar
Song H., Willi, M., Thiagarajan, J.J., Berisha, V., Spanias, A.: Triplet network with attention for speaker diarization. In: Proceedings of Interspeech, pp. 3608–3612 (2018)
Google Scholar
Zhang, C., Koishida, K.: End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of Interspeech, pp. 1487–1491 (2017)
Google Scholar
Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system
Google Scholar
Turpault, N., Serizel, R., Vincent, E.: Semi-supervised triplet loss-based learning of ambient audio embeddings. ICASSP 2019. Brighton, United Kingdom (2019)
Google Scholar
Sacchi, N., Nanchen, A., Jaggi, M., Cerňak, M.: Open-vocabulary keyword spotting with audio and text embeddings, pp. 3362–3366
Google Scholar
Shor, J., et al.: Towards learning a universal non-semantic representation of speech. In: Proceedings of Interspeech, pp. 140–144 (2020)
Google Scholar
Yuan, Y., Lv, Z., Huang, S., Xie, L.: Verifying deep keyword spotting detection with acoustic word embeddings. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 Proceedings, no. 61571363, pp. 613–620 (2019)
Google Scholar
Huh, J., Lee, M., Heo, H., Mun, S., Chung, J.S.: Metric learning for keyword spotting, 2021 IEEE Spoken Language Technology Workshop (SLT). In: IEEE, pp. 133–140 (2021)
Google Scholar
Huang, J., Gharbieh, W., Shim, H.S., Kim, E.: Query-by-example keyword spotting system using multi-head attention and softtriple loss. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6858–6862 (2021)
Google Scholar
Tang, R. and Lin, J.: Honk: A PyTorch reimplementation of convolutional neural networks for keyword spotting 2017. http://arxiv.org/abs/1710.06554 (2021)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: An ASR corpus based on public domain audio books. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015)
Google Scholar
Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y.: Speech model pre-training for end-to-end spoken language understanding. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 814–818 (2019)
Google Scholar
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Proceedings of the Annual Conference of the International Speech Communication Association INTERSPEECH, pp. 498–502 (2017)
Google Scholar
https://zenodo.org/record/2619474. Accessed 2 Jan 2021
Ahmed, A.F., Sherif, M.A., Ngomo, A.C.N.: Do your resources sound similar?: On the impact of using phonetic similarity in link discovery, in K-CAP 2019. In: 10th International Conference on Knowledge Capture 8(19), 53–60 (2019)
Google Scholar
Ginsburg, B., et al.: Stochastic gradient methods with layer-wise adaptive moments for training of deep networks
Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2021)
Article Google Scholar

Download references

Acknowledgments

The authors are grateful to:

• colleagues at NTR Labs Machine Learning Research group for the discussions and support;

• Prof. Sergey Orlov and Prof. Oleg Zmeev for the computing facilities provided;

• Nikolay Shmyrev for pointing out to the works [38, 39].

Author information

Authors and Affiliations

Higher IT School, Tomsk State University, 634050, Tomsk, Russia
Roman Vygon & Nikolay Mikhaylovskiy
NTR Labs, 129594, Moscow, Russia
Roman Vygon & Nikolay Mikhaylovskiy

Authors

Roman Vygon
View author publications
You can also search for this author in PubMed Google Scholar
Nikolay Mikhaylovskiy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolay Mikhaylovskiy .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vygon, R., Mikhaylovskiy, N. (2021). Learning Efficient Representations for Keyword Spotting with Triplet Loss. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_69

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_69
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Abstract

Access this chapter

Similar content being viewed by others

Prototypical Metric Transfer Learning for Continuous Speech Keyword Spotting with Limited Training Data

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

On the Use of Convolutional Neural Networks in Pairwise Language Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Abstract

Access this chapter

Similar content being viewed by others

Prototypical Metric Transfer Learning for Continuous Speech Keyword Spotting with Limited Training Data

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

On the Use of Convolutional Neural Networks in Pairwise Language Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation