Skip to main content

End-to-End Architectures for Speech Recognition

  • Chapter
  • First Online:
Book cover New Era for Robust Speech Recognition

Abstract

Automatic speech recognition (ASR) has traditionally integrated ideas from many different domains, such as signal processing (mel-frequency cepstral coefficient features), natural language processing (n-gram language models), or statistics (hidden markov models). Because of this “compartmentalization,” it is widely accepted that components of an ASR system will largely be optimized individually and in isolation, which will negatively influence overall performance. End-to-end approaches attempt to solve this problem by optimizing components jointly, and using a single criterion. This can also reduce the need for human experts to design and build speech recognition systems by painstakingly finding the best combination of several resources—which is still somewhat of a “black art.” This chapter will first discuss several recent deep-learning-based approaches to end-to-end speech recognition. Next, we will present the EESEN framework, which combines connectionist-temporal-classification-based acoustic models with a weighted finite state transducer decoding setup. EESEN achieves state-of-the art word error rates, while at the same time drastically simplifying the ASR pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Code and latest recipes and results can be found at https://github.com/srvk/eesen.

  2. 2.

    http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

  3. 3.

    Note that the HMM/LSTM system employed a unidirectional LSTM while the EESEN system applied a bidirectional LSTM. Readers should take this discrepancy into account when evaluating the results.

References

  1. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFST: a general and efficient weighted finite-state transducer library. In: Holub, J., Ždávn, J. (eds.) Implementation and Application of Automata, pp. 11–23. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  2. Bacchiani, M., Senior, A., Heigold, G.: Asynchronous, online, GMM-free training of a context dependent acoustic model for speech recognition. In: Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, Singapore (2014)

    Google Scholar 

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473

    Google Scholar 

  4. Bahdanau, D., Serdyuk, D., Brakel, P., Ke, N.R., Chorowski, J., Courville, A.C., Bengio, Y.: Task loss estimation for sequence prediction. CoRR abs/1511.06456 (2015). http://arxiv.org/abs/1511.06456

  5. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: Seventeenth Annual Conference of the International Speech Communication Association (INTERSPEECH) (2016)

    Google Scholar 

  6. Bahl, L.R., Jelinek, F., Mercer, R.L.: A maximum likelihood approach to continuous speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 5(2), 179–190 (1983)

    Article  Google Scholar 

  7. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  8. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  9. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell (2015). arXiv preprint arXiv:1508.01211

    Google Scholar 

  10. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York (2016)

    Google Scholar 

  11. Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder–decoder approaches. CoRR abs/1409.1259 (2014). http://arxiv.org/abs/1409.1259

  12. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078

    Google Scholar 

  13. Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results (2014). arXiv preprint arXiv:1412.1602

    Google Scholar 

  14. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)

    Google Scholar 

  15. Collobert, R., Puhrsch, C., Synnaeve, G.: Wav2Letter: an end-to-end convNet-based speech recognition system. CoRR abs/1609.03193 (2016). http://arxiv.org/abs/1609.03193

  16. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  17. Fernández, S., Graves, A., Schmidhuber, J.: An application of recurrent neural networks to discriminative keyword spotting. In: Artificial Neural Networks–ICANN 2007, pp. 220–229. Springer, Heidelberg (2007)

    Google Scholar 

  18. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N 93 (1993)

    Google Scholar 

  19. Geras, K.J., Mohamed, A.R., Caruana, R., Urban, G., Wang, S., Aslan, O., Philipose, M., Richardson, M., Sutton, C.: Blending LSTMS into CNNS (2015). arXiv preprint arXiv:1511.06433

    Google Scholar 

  20. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3, 115–143 (2003)

    MathSciNet  MATH  Google Scholar 

  21. Ghahremani, P., BabaAli, B., Povey, D., et al.: A pitch extraction algorithm tuned for automatic speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2494–2498. IEEE, New York (2014)

    Google Scholar 

  22. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

    Google Scholar 

  23. Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: telephone speech corpus for research and development. In: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-92, vol. 1, pp. 517–520. IEEE, New York (1992)

    Google Scholar 

  24. Graves, A.: Sequence transduction with recurrent neural networks (2012). arXiv preprint arXiv:1211.3711

    Google Scholar 

  25. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772 (2014)

    Google Scholar 

  26. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning (ICML-06), pp. 369–376 (2006)

    Google Scholar 

  27. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deepspeech: scaling up end-to-end speech recognition (2014). arXiv preprint arXiv:1412.5567

    Google Scholar 

  28. Hannun, A.Y., Maas, A.L., Jurafsky, D., Ng, A.Y.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:1408.2873 (2014)

    Google Scholar 

  29. Hinton, G.E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, G., Müller, K.R. (eds.) Neural Networks: Tricks of the Trade, pp. 599–619. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  30. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  31. Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE, New York (2015)

    Google Scholar 

  32. Hwang, K., Sung, W.: Online sequence training of recurrent neural networks with connectionist temporal classification (2015). arXiv preprint arXiv:1511.06841

    Google Scholar 

  33. Kalchbrenner, N., Blunsom, P.: Recurrent convolutional neural networks for discourse compositionality. CoRR abs/1306.3584 (2013). http://arxiv.org/abs/1306.3584

  34. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

    Google Scholar 

  35. Karpathy, A., Johnson, J., Li, F.F.: Visualizing and understanding recurrent networks (2015). arXiv preprint arXiv:1506.02078

    Google Scholar 

  36. Kilgour, K.: Modularity and neural integration in large-vocabulary continuous speech recognition. Ph.D. thesis, Karlsruhe Institute of Technology (2015)

    Google Scholar 

  37. Kingsbury, B.: Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3761–3764. IEEE, New York (2009)

    Google Scholar 

  38. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  39. Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5325–5334 (2015)

    Google Scholar 

  40. Li, J., Zhang, H., Cai, X., Xu, B.: Towards end-to-end speech recognition for Chinese Mandarin using long short-term memory recurrent neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, Dresden (2015)

    Google Scholar 

  41. Liu, Y., Fung, P., Yang, Y., Cieri, C., Huang, S., Graff, D.: HKUST/MTS: a very large scale Mandarin telephone speech corpus. In: Chinese Spoken Language Processing, pp. 724–735 (2006)

    Google Scholar 

  42. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999, vol. 2, pp. 1150–1157. IEEE, New York (1999)

    Google Scholar 

  43. Lu, L., Zhang, X., Cho, K., Renals, S.: A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  44. Lu, L., Kong, L., Dyer, C., Smith, N.A., Renals, S.: Segmental recurrent neural networks for end-to-end speech recognition. CoRR abs/1603.00223 (2016). http://arxiv.org/abs/1603.00223

  45. Lu, L., Zhang, X., Renals, S.: On training the recurrent neural network encoder–decoder for large vocabulary end-to-end speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New York (2016)

    Google Scholar 

  46. Maas, A.L., Xie, Z., Jurafsky, D., Ng, A.Y.: Lexicon-free conversational speech recognition with neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2015)

    Google Scholar 

  47. Mangu, L., Brill, E., Stolcke, A.: Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Comput. Speech Lang. 14(4), 373–400 (2000)

    Article  Google Scholar 

  48. Metze, F., Fosler-Lussier, E., Bates, R.: The speech recognition virtual kitchen. In: Proceedings of INTERSPEECH. ISCA, Lyon, France (2013). https://github.com/srvk/eesen-transcriber

    Google Scholar 

  49. Miao, Y., Metze, F.: On speaker adaptation of long short-term memory recurrent neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, Dresden (2015)

    Google Scholar 

  50. Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, New York (2015)

    Google Scholar 

  51. Mohamed, A.R., Seide, F., Yu, D., Droppo, J., Stoicke, A., Zweig, G., Penn, G.: Deep bi-directional recurrent networks over spectral windows. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 78–83. IEEE, New York (2015)

    Google Scholar 

  52. Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002)

    Article  Google Scholar 

  53. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)

    Google Scholar 

  54. Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks (2013). arXiv preprint arXiv:1304.1018

    Google Scholar 

  55. Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics, Morristown (1992)

    Google Scholar 

  56. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., Veselý, K.: The Kaldi speech recognition toolkit. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 1–4. IEEE, New York (2011)

    Google Scholar 

  57. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  58. Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584. IEEE, New York (2015)

    Google Scholar 

  59. Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, Dresden (2015)

    Google Scholar 

  60. Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, Singapore (2014)

    Google Scholar 

  61. Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4280–4284. IEEE, New York (2015)

    Google Scholar 

  62. Senior, A., Heigold, G., Bacchiani, M., Liao, H.: GMM-free DNN training. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5639–5643. IEEE, New York (2014)

    Google Scholar 

  63. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  64. Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 890–894. ISCA, Singapore (2014)

    Google Scholar 

  65. Watanabe, S., Le Roux, J.: Black box optimization for automatic speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3256–3260. IEEE, New York (2014)

    Google Scholar 

  66. Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: Spoken term detection with connectionist temporal classification: a novel hybrid CTC-DBN decoder. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5274–5277. IEEE, New York (2010)

    Google Scholar 

  67. Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention (2015). arXiv preprint arXiv:1502.03044

    Google Scholar 

  68. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)

    Google Scholar 

  69. Zhang, C., Woodland, P.C.: Standalone training of context-dependent deep neural network acoustic models. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5597–5601. IEEE, New York (2014)

    Google Scholar 

  70. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)

    Google Scholar 

  71. Zweig, G., Yu, C., Droppo, J., Stolcke, A.: Advances in all-neural speech recognition (2016). arXiv:1609.05935

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Metze .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Miao, Y., Metze, F. (2017). End-to-End Architectures for Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics