Skip to main content

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12714))

Included in the following conference series:

  • 1504 Accesses

Abstract

In this paper, we present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations. STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word using its contextual spoken word’s speech and text, such that the model encodes its meaningful latent representations. Unlike existing work, we have used text along with speech for auditory representation learning to capture semantical and syntactical information along with the acoustic and temporal information. The latent representations produced by our model were not only able to predict the target phonetic sequences with an accuracy of 89.47% but were also able to achieve competitive results to textual word representation models, Word2Vec & FastText (trained on textual transcripts), when evaluated on four widely used word similarity benchmark datasets. In addition, investigation of the generated vector space also demonstrated the capability of the proposed model to capture the phonetic structure of the spoken-words. To the best of our knowledge, none of the existing works use speech and text entanglement for learning spoken-word representation, which makes this work the first of its kind.

P. Mishra—Independent Researcher.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baker, S., Reichart, R., Korhonen, A.: An unsupervised model for instance level subcategorization acquisition. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 278–289 (2014)

    Google Scholar 

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  3. Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, New York (2012). https://doi.org/10.1007/978-1-4615-3210-1

  4. Chen, Y., Huang, S., Lee, H., Wang, Y., Shen, C.: Audio Word2Vec: sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(9), 1481–1493 (2019)

    Article  Google Scholar 

  5. Chorowski, J., Weiss, R.J., Bengio, S., van den Oord, A.: Unsupervised speech representation learning using WaveNet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Proc. 27(12), 2041–2053 (2019). https://doi.org/10.1109/TASLP.2019.2938863

  6. Cui, J., et al.: Multilingual representations for low resource speech recognition and keyword search. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 259–266 (2015)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics, June 2019. https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423

  8. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM \(\{\)TIMIT\(\}\) (1993)

    Google Scholar 

  9. Glass, J.: Challenges for spoken dialogue systems. In: Proceedings of the 1999 IEEE ASRU Workshop, vol. 696 (1999)

    Google Scholar 

  10. Graves, A., Jaitly, N., Mohamed, A.r.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 273–278. IEEE (2013)

    Google Scholar 

  11. Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013)

    Google Scholar 

  12. Halawi, G., Dror, G., Gabrilovich, E., Koren, Y.: Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1406–1414 (2012)

    Google Scholar 

  13. Herff, C., Schultz, T.: Automatic speech recognition from neural signals: a focused review. Front. Neurosci. 10, 429 (2016)

    Article  Google Scholar 

  14. Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)

    Article  MathSciNet  Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

  16. Hori, T., Cho, J., Watanabe, S.: End-to-end speech recognition with word-based RNN language models. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 389–396. IEEE (2018)

    Google Scholar 

  17. Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24(6), 417 (1933)

    Article  Google Scholar 

  18. Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised learning of disentangled and interpretable representations from sequential data. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  19. Kamper, H.: Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6535–3539 (2019)

    Google Scholar 

  20. Khurana, S., et al.: A convolutional deep Markov model for unsupervised speech representation learning (2020)

    Google Scholar 

  21. Li, X., Wu, X.: Modeling speaker variability using long short-term memory networks for speech recognition. In: INTERSPEECH (2015)

    Google Scholar 

  22. Ling, S., Salazar, J., Liu, Y., Kirchhoff, K.: BERTphone: phonetically-aware encoder representations for utterance-level speaker and language recognition. In: Proceedings of Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 9–16 (2020). https://doi.org/10.21437/Odyssey.2020-2

  23. Liu, A.T., Li, S.W., Yi Lee, H.: TERA: self-supervised learning of transformer encoder representation for speech (2020)

    Google Scholar 

  24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  25. Moriya, Y., Jones, G.J.: LSTM language model adaptation with images and titles for multimedia automatic speech recognition. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 219–226. IEEE (2018)

    Google Scholar 

  26. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018). http://arxiv.org/abs/1807.03748

  27. Polka, L., Orena, A.J., Sundara, M., Worrall, J.: Segmenting words from fluent speech during infancy–challenges and opportunities in a bilingual context. Dev. Sci. 20(1), e12419 (2017)

    Google Scholar 

  28. Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S., Sainath, T.: Deep learning for audio signal processing. IEEE J. Sel. Top. Sign. Process. 13(2), 206–219 (2019)

    Article  Google Scholar 

  29. Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y.: Light gated recurrent units for speech recognition. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 92–102 (2018)

    Article  Google Scholar 

  30. Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. In: Proceedings of Interspeech 2019, pp. 3465–3469 (2019). https://doi.org/10.21437/Interspeech.2019-1873

  31. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  32. Tan, T., et al.: Speaker-aware training of LSTM-RNNs for acoustic modelling. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284 (2016)

    Google Scholar 

  33. Tang, Z., Shi, Y., Wang, D., Feng, Y., Zhang, S.: Memory visualization for gated recurrent neural networks in speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2736–2740. IEEE (2017)

    Google Scholar 

  34. Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second ‘CHiME’ speech separation and recognition challenge: an overview of challenge systems and outcomes. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 162–167. IEEE (2013)

    Google Scholar 

  35. Yang, D., Powers, D.M.: Verb similarity on the taxonomy of WordNet. Masaryk University (2006)

    Google Scholar 

  36. Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2462–2466. IEEE (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mishra, P. (2021). STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning. In: Karlapalem, K., et al. Advances in Knowledge Discovery and Data Mining. PAKDD 2021. Lecture Notes in Computer Science(), vol 12714. Springer, Cham. https://doi.org/10.1007/978-3-030-75768-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-75768-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-75767-0

  • Online ISBN: 978-3-030-75768-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics