Skip to main content

Learning Efficient Representations for Keyword Spotting with Triplet Loss

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2021)

Abstract

In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most notably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA for Google Speech Commands dataset V1 10 + 2 -class classification by about 34%, achieving 98.55% accuracy, V2 10 + 2-class classification by about 20%, achieving 98.37% accuracy, and V2 35-class classification by over 50%, achieving 97.0% accuracy. (Code is available at https://github.com/roman-vygon/triplet_loss_kws).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Tang, R., Lin, J.: Deep residual learning for small-footprint keyword spotting. In: International Conference on Acoustics, Speech and Signal Processing, pp. 5484–5488 (2018)

    Google Scholar 

  2. Zhang, Y., Suda, N., Lai, L., Chandra, V.: Hello Edge: Keyword Spotting on Microcontrollers

    Google Scholar 

  3. de Andrade, D., Sabato, L., Viana, M., Bernkopf, C.: A neural attention model for speech command recognition

    Google Scholar 

  4. Teacher, C., Kellett, Y., Focht, L.: Experimental, limited vocabulary, speech recognizer. IEEE Trans. Audio Electroacoust. 15(3), 127–130 (1967)

    Article  Google Scholar 

  5. Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: Acoustics, Speech, and Signal Processing, pp. 627–630 (1989)

    Google Scholar 

  6. Szöke, I., Schwarz, P., Matějka, P., Burget, L., Karafiát, M., Černocký, J.: Phoneme based acoustics keyword spotting in informal continuous speech. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 302–309. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_39

    Chapter  Google Scholar 

  7. Zhang, S., Shuang, Z., Shi, Q., Qin, Y.: Improved mandarin keyword spotting using confusion garbage model. In: 2010 20th International Conference on Pattern Recognition (ICPR), pp. 3700–3703

    Google Scholar 

  8. Greibus, M., Telksnys, L.: Speech keyword spotting with rule based segmentation. In: Skersys, T., Butleris, R., Butkiene, R. (eds.) ICIST 2013. CCIS, vol. 403, pp. 186–197. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41947-8_17

    Chapter  Google Scholar 

  9. Principi, S.S., Bonfigli, R., Ferroni, G., Piazza, F.: An integrated system for voice command recognition and emergency detection based on audio signals. Expert Syst. Appl. 42(13), 5668–5683 (2015). https://doi.org/10.1016/j.eswa.2015.02.036

    Article  Google Scholar 

  10. Chen, G., Parada, C., Heigold, G.: Small-footprint keyword spotting using deep neural networks. In: Acoustics, Speech and Signal Processing, International Conference on, p. 4087–4091 (2014)

    Google Scholar 

  11. Sainath, T.N., Parada C.: Convolutional neural networks for small-footprint keyword spotting. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  12. Arik, S.O., et al.: Convolutional recurrent neural networks for small-footprint keyword spotting (2017)

    Google Scholar 

  13. Sun, M., et al.: Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. In: Spoken Language Technology Workshop, pp. 474–480 (2016)

    Google Scholar 

  14. He, Y., Prabhavalkar, R., Rao, K., Li, W., Bakhtin, A., McGraw, I.: Streaming small-footprint keyword spotting using sequence-to-sequence models. In: Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474–481 (2017)

    Google Scholar 

  15. Lei, J., et al.: Low-power audio keyword spotting using Tsetlin machines. J. Low Power Electron. Appl. 11(2): 18

    Google Scholar 

  16. Warden, P.: Speech commands: a public dataset for single-word speech recognition

    Google Scholar 

  17. Jansson, P.: Single-word speech recognition with convolutional neural networks on raw waveforms. Degree Thesis, Information technology, ARCADA University, Finland

    Google Scholar 

  18. Majumdar, S., Ginsburg, B.: MatchboxNet: 1D time-channel separable convolutional neural network architecture for speech commands recognition. In: Proceedings of Interspeech, pp. 3356–3360. https://doi.org/10.21437/Interspeech.2020-1058 (2020)

  19. Mordido, G., Van Keirsbilck, M., Keller, A.: Compressing 1D time-channel separable convolutions using sparse random ternary matrices (2021)

    Google Scholar 

  20. Rybakov O., Kononenko N., Subrahmanya N., Visontai M., Laurenzo S.: Streaming keyword spotting on mobile devices. In: Proceedings of Interspeech, pp. 2277–2281 (2020)

    Google Scholar 

  21. Wei, Y., Gong, Z., Yang, S., Ye, K., Wen, Y.: EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J. Ambient. Intell. Humaniz. Comput. 1–11 (2021). https://doi.org/10.1007/s12652-021-03022-1

  22. Tang, R., et al.: Howl: a deployed, open-source wake word detection system. In: Proceedings of Second Workshop for NLP Open-Source Software (NLP-OSS), pp. 61–65 (2020)

    Google Scholar 

  23. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification

    Google Scholar 

  24. Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393 (2014)

    Google Scholar 

  25. Chechik, G., Sharma, V., Shalit, U., Bengio, S.: Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11, 1109–1135 (2010)

    MathSciNet  MATH  Google Scholar 

  26. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015)

    Google Scholar 

  27. Huang, J., Li, Y., Tao, J., Lian, Z.: Speech emotion recognition from variable-length inputs with triplet loss function. In: Proceedings of INTERSPEECH, pp. 3673–3677 (2018)

    Google Scholar 

  28. Ren, M., Nie, W., Liu, A., Su, Y.: Multi-modal correlated network for emotion recognition in speech. Vis. Informat. 3(3), 150–155 (2019)

    Article  Google Scholar 

  29. Kumar, P., Jain, S., Raman, B, Roy, P.P., Iwamura, M.: End-to-end triplet loss based emotion embedding system for speech emotion recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 8766–8773 (2021)

    Google Scholar 

  30. Harvill, J., AbdelWahab, M., Lotfian, R., Busso, C.: Retrieving speech samples with similar emotional content using a triplet loss function. In: International Conference on Acoustics, Speech and Signal Processing, Brighton, United Kingdom, pp. 7400–7404 (2019)

    Google Scholar 

  31. Bredin, H.: Tristounet: triplet loss for speaker turns embedding. In: 2017 IEEE International Conference on Acoustics, speech and Signal Processing (ICASSP), pp. 5430–5434 (2017)

    Google Scholar 

  32. Song H., Willi, M., Thiagarajan, J.J., Berisha, V., Spanias, A.: Triplet network with attention for speaker diarization. In: Proceedings of Interspeech, pp. 3608–3612 (2018)

    Google Scholar 

  33. Zhang, C., Koishida, K.: End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of Interspeech, pp. 1487–1491 (2017)

    Google Scholar 

  34. Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system

    Google Scholar 

  35. Turpault, N., Serizel, R., Vincent, E.: Semi-supervised triplet loss-based learning of ambient audio embeddings. ICASSP 2019. Brighton, United Kingdom (2019)

    Google Scholar 

  36. Sacchi, N., Nanchen, A., Jaggi, M., Cerňak, M.: Open-vocabulary keyword spotting with audio and text embeddings, pp. 3362–3366

    Google Scholar 

  37. Shor, J., et al.: Towards learning a universal non-semantic representation of speech. In: Proceedings of Interspeech, pp. 140–144 (2020)

    Google Scholar 

  38. Yuan, Y., Lv, Z., Huang, S., Xie, L.: Verifying deep keyword spotting detection with acoustic word embeddings. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 Proceedings, no. 61571363, pp. 613–620 (2019)

    Google Scholar 

  39. Huh, J., Lee, M., Heo, H., Mun, S., Chung, J.S.: Metric learning for keyword spotting, 2021 IEEE Spoken Language Technology Workshop (SLT). In: IEEE, pp. 133–140 (2021)

    Google Scholar 

  40. Huang, J., Gharbieh, W., Shim, H.S., Kim, E.: Query-by-example keyword spotting system using multi-head attention and softtriple loss. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6858–6862 (2021)

    Google Scholar 

  41. Tang, R. and Lin, J.: Honk: A PyTorch reimplementation of convolutional neural networks for keyword spotting 2017. http://arxiv.org/abs/1710.06554 (2021)

  42. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: An ASR corpus based on public domain audio books. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015)

    Google Scholar 

  43. Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y.: Speech model pre-training for end-to-end spoken language understanding. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 814–818 (2019)

    Google Scholar 

  44. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Proceedings of the Annual Conference of the International Speech Communication Association INTERSPEECH, pp. 498–502 (2017)

    Google Scholar 

  45. https://zenodo.org/record/2619474. Accessed 2 Jan 2021

  46. Ahmed, A.F., Sherif, M.A., Ngomo, A.C.N.: Do your resources sound similar?: On the impact of using phonetic similarity in link discovery, in K-CAP 2019. In: 10th International Conference on Knowledge Capture 8(19), 53–60 (2019)

    Google Scholar 

  47. Ginsburg, B., et al.: Stochastic gradient methods with layer-wise adaptive moments for training of deep networks

    Google Scholar 

  48. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2021)

    Article  Google Scholar 

Download references

Acknowledgments

The authors are grateful to:

• colleagues at NTR Labs Machine Learning Research group for the discussions and support;

• Prof. Sergey Orlov and Prof. Oleg Zmeev for the computing facilities provided;

• Nikolay Shmyrev for pointing out to the works [38, 39].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolay Mikhaylovskiy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vygon, R., Mikhaylovskiy, N. (2021). Learning Efficient Representations for Keyword Spotting with Triplet Loss. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_69

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_69

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics