Advertisement

End-to-End Speech Recognition

  • Uday Kamath
  • John Liu
  • James Whitaker
Chapter

Abstract

In Chap.  8, we aimed to create an ASR system by dividing the fundamental equation
$$\displaystyle W^* = \operatorname *{argmax}_{W \in V^*} P(W|X) $$
into an acoustic model, lexicon model, and language model by using Bayes’ theorem. This approach relies heavily on the use of the conditional independence assumption and separate optimization procedures for the different models.

References

  1. [Amo+16]
    Dario Amodei et al. “Deep speech 2: End-to-end speech recognition in English and Mandarin”. In: International Conference on Machine Learning. 2016, pp. 173–182.Google Scholar
  2. [BCB14a]
    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014).Google Scholar
  3. [Bah+16c]
    Dzmitry Bahdanau et al. “End-to-end attention-based large vocabulary speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4945–4949.Google Scholar
  4. [BH14]
    Samy Bengio and Georg Heigold. “Word embeddings for speech recognition”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014.Google Scholar
  5. [Cha+16b]
    William Chan et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4960–4964.Google Scholar
  6. [Cho+14]
    Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014).Google Scholar
  7. [Cho+15c]
    Jan K Chorowski et al. “Attention-based models for speech recognition”. In: Advances in neural information processing systems. 2015, pp. 577–585.Google Scholar
  8. [Chu+16]
    Y.-A. Chung et al. “Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder”. In: ArXiv e-prints (Mar 2016).Google Scholar
  9. [CPS16]
    Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. “Wav2letter: an end-to-end ConvNet-based speech recognition system”. In: arXiv preprint arXiv:1609.03193 (2016).Google Scholar
  10. [Gra12]
    Alex Graves. “Sequence transduction with recurrent neural networks”. In: arXiv preprint arXiv:1211.3711 (2012).Google Scholar
  11. [GMH13]
    Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks”. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE. 2013, pp. 6645–6649.Google Scholar
  12. [Gra+06]
    Alex Graves et al. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 369–376.Google Scholar
  13. [Gul+15]
    Caglar Gulcehre et al. “On using monolingual corpora in neural machine translation”. In: arXiv preprint arXiv:1503.03535 (2015).Google Scholar
  14. [Han17]
    Awni Hannun. “Sequence Modeling with CTC”. In: Distill. (2017).Google Scholar
  15. [Han+14a]
    Awni Hannun et al. “Deep speech: Scaling up end-to-end speech recognition”. In: arXiv preprint arXiv:1412.5567 (2014).Google Scholar
  16. [Han+14b]
    Awni Y Hannun et al. “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs”. In: arXiv preprint arXiv:1408.2873 (2014).Google Scholar
  17. [He+18]
    Yanzhang He et al. “Streaming End-to-end Speech Recognition For Mobile Devices”. In: arXiv preprint arXiv:1811.06621 (2018).Google Scholar
  18. [Hea+13]
    Kenneth Heafield et al. “Scalable modified Kneser-Ney language model estimation”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Vol. 2. 2013, pp. 690–696.Google Scholar
  19. [HCW18]
    Takaaki Hori, Jaejin Cho, and Shinji Watanabe. “End-to-end Speech Recognition with Word-based RNN Language Models”. In: arXiv preprint arXiv:1808.02608 (2018).Google Scholar
  20. [Hor+17]
    Takaaki Hori et al. “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM”. In: arXiv preprint arXiv:1706.02737 (2017).Google Scholar
  21. [KWL16]
    Herman Kamper, Weiran Wang, and Karen Livescu. “Deep convolutional acoustic word embeddings using word-pair side information”. In: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE. 2016, pp. 4950–4954.Google Scholar
  22. [KHW17]
    Suyoun Kim, Takaaki Hori, and Shinji Watanabe. “Joint CTC-attention based end-to-end speech recognition using multi-task learning”. In: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE. 2017, pp. 4835–4839.Google Scholar
  23. [Li+17]
    J. Li et al. “Acoustic-To-Word Model Without OOV”. In: ArXiv e-prints (Nov.2017).Google Scholar
  24. [Liu+17]
    Hairong Liu et al. “Gram-CTC: Automatic unit selection and target decomposition for sequence labelling”. In: arXiv preprint arXiv:1703.00096 (2017).Google Scholar
  25. [MB18]
    Benjamin Milde and Chris Biemann. “Unspeech: Unsupervised Speech Context Embeddings”. In: arXiv preprint arXiv:1804.06775 (2018).Google Scholar
  26. [MHG+14]
    Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent models of visual attention”. In: Advances in neural information processing systems. 2014, pp. 2204–2212.Google Scholar
  27. [Pra+18]
    Vineel Pratap et al. “wav2letter+ +: The Fastest Open-source Speech Recognition System”. In: arXiv preprint arXiv:1812.07625 (2018).Google Scholar
  28. [SLS16]
    Hagen Soltau, Hank Liao, and Hasim Sak. “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition”. In: arXiv preprint arXiv:1610.09975 (2016).Google Scholar
  29. [Sri+17]
    Anuroop Sriram et al. “Cold fusion: Training seq2seq models together with language models”. In: arXiv preprint arXiv:1708.06426 (2017).Google Scholar
  30. [Tos+18]
    Shubham Toshniwal et al. “A comparison of techniques for language model integration in encoder-decoder speech recognition”. In: arXiv preprint arXiv:1807.10857 (2018).Google Scholar
  31. [VDO+16]
    Aäron Van Den Oord et al. “WaveNet: A generative model for raw audio.” In: SSW. 2016, p. 125.Google Scholar
  32. [WLL18]
    Yu-Hsuan Wang, Hung-yi Lee, and Lin-shan Lee “Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 6269–6273.Google Scholar
  33. [Wat+17b]
    Shinji Watanabe et al. “Hybrid CTC/attention architecture for end-to-end speech recognition”. In: IEEE Journal of Selected Topics in Signal Processing 11.8 (2017), pp. 1240–1253.Google Scholar
  34. [Xia+18]
    Zhangyu Xiao et al. “Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units”. In: arXiv preprint arXiv:1807.04978 (2018).Google Scholar
  35. [ZSG90]
    Victor Zue, Stephanie Seneff, and James Glass. “Speech database development at MIT: TIMIT and beyond”. In: Speech communication 9.4 (1990), pp. 351–356.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Uday Kamath
    • 1
  • John Liu
    • 2
  • James Whitaker
    • 1
  1. 1.Digital Reasoning Systems Inc.McLeanUSA
  2. 2.Intelluron CorporationNashvilleUSA

Personalised recommendations