Abstract
In this paper, we present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations. STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word using its contextual spoken word’s speech and text, such that the model encodes its meaningful latent representations. Unlike existing work, we have used text along with speech for auditory representation learning to capture semantical and syntactical information along with the acoustic and temporal information. The latent representations produced by our model were not only able to predict the target phonetic sequences with an accuracy of 89.47% but were also able to achieve competitive results to textual word representation models, Word2Vec & FastText (trained on textual transcripts), when evaluated on four widely used word similarity benchmark datasets. In addition, investigation of the generated vector space also demonstrated the capability of the proposed model to capture the phonetic structure of the spoken-words. To the best of our knowledge, none of the existing works use speech and text entanglement for learning spoken-word representation, which makes this work the first of its kind.
P. Mishra—Independent Researcher.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baker, S., Reichart, R., Korhonen, A.: An unsupervised model for instance level subcategorization acquisition. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 278–289 (2014)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, New York (2012). https://doi.org/10.1007/978-1-4615-3210-1
Chen, Y., Huang, S., Lee, H., Wang, Y., Shen, C.: Audio Word2Vec: sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(9), 1481–1493 (2019)
Chorowski, J., Weiss, R.J., Bengio, S., van den Oord, A.: Unsupervised speech representation learning using WaveNet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Proc. 27(12), 2041–2053 (2019). https://doi.org/10.1109/TASLP.2019.2938863
Cui, J., et al.: Multilingual representations for low resource speech recognition and keyword search. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 259–266 (2015)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics, June 2019. https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM \(\{\)TIMIT\(\}\) (1993)
Glass, J.: Challenges for spoken dialogue systems. In: Proceedings of the 1999 IEEE ASRU Workshop, vol. 696 (1999)
Graves, A., Jaitly, N., Mohamed, A.r.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 273–278. IEEE (2013)
Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013)
Halawi, G., Dror, G., Gabrilovich, E., Koren, Y.: Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1406–1414 (2012)
Herff, C., Schultz, T.: Automatic speech recognition from neural signals: a focused review. Front. Neurosci. 10, 429 (2016)
Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Hori, T., Cho, J., Watanabe, S.: End-to-end speech recognition with word-based RNN language models. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 389–396. IEEE (2018)
Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24(6), 417 (1933)
Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised learning of disentangled and interpretable representations from sequential data. In: Advances in Neural Information Processing Systems (2017)
Kamper, H.: Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6535–3539 (2019)
Khurana, S., et al.: A convolutional deep Markov model for unsupervised speech representation learning (2020)
Li, X., Wu, X.: Modeling speaker variability using long short-term memory networks for speech recognition. In: INTERSPEECH (2015)
Ling, S., Salazar, J., Liu, Y., Kirchhoff, K.: BERTphone: phonetically-aware encoder representations for utterance-level speaker and language recognition. In: Proceedings of Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 9–16 (2020). https://doi.org/10.21437/Odyssey.2020-2
Liu, A.T., Li, S.W., Yi Lee, H.: TERA: self-supervised learning of transformer encoder representation for speech (2020)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Moriya, Y., Jones, G.J.: LSTM language model adaptation with images and titles for multimedia automatic speech recognition. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 219–226. IEEE (2018)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018). http://arxiv.org/abs/1807.03748
Polka, L., Orena, A.J., Sundara, M., Worrall, J.: Segmenting words from fluent speech during infancy–challenges and opportunities in a bilingual context. Dev. Sci. 20(1), e12419 (2017)
Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S., Sainath, T.: Deep learning for audio signal processing. IEEE J. Sel. Top. Sign. Process. 13(2), 206–219 (2019)
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y.: Light gated recurrent units for speech recognition. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 92–102 (2018)
Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. In: Proceedings of Interspeech 2019, pp. 3465–3469 (2019). https://doi.org/10.21437/Interspeech.2019-1873
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45(11), 2673–2681 (1997)
Tan, T., et al.: Speaker-aware training of LSTM-RNNs for acoustic modelling. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284 (2016)
Tang, Z., Shi, Y., Wang, D., Feng, Y., Zhang, S.: Memory visualization for gated recurrent neural networks in speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2736–2740. IEEE (2017)
Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second ‘CHiME’ speech separation and recognition challenge: an overview of challenge systems and outcomes. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 162–167. IEEE (2013)
Yang, D., Powers, D.M.: Verb similarity on the taxonomy of WordNet. Masaryk University (2006)
Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2462–2466. IEEE (2017)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mishra, P. (2021). STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning. In: Karlapalem, K., et al. Advances in Knowledge Discovery and Data Mining. PAKDD 2021. Lecture Notes in Computer Science(), vol 12714. Springer, Cham. https://doi.org/10.1007/978-3-030-75768-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-75768-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75767-0
Online ISBN: 978-3-030-75768-7
eBook Packages: Computer ScienceComputer Science (R0)