Automatic Speech Recognition

Lu, Xugang; Li, Sheng; Fujimoto, Masakiyo

doi:10.1007/978-981-15-0595-9_2

Xugang Lu¹⁷,
Sheng Li¹⁷ &
Masakiyo Fujimoto¹⁷

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

1065 Accesses
10 Citations

Abstract

The main task of automatic speech recognition (ASR) is to convert voice signals to text transcriptions. It is one of the most important research fields in natural language processing (NLP). With more than a half century of endeavor, the word error rate (WER), which is a metric unit for transcription performance, has significantly been reduced. Particularly in recent years, due to the increase of computational power, large quantity of collected data, and efficient neural learning algorithms, the dominant power of deep learning technology further enhanced the performance of ASR systems to a practical level. However, there are still many issues that need to be further investigated for these systems to be adapted to a wide range of applications. In this chapter, we will introduce the main stream and pipeline of ASR frameworks, particularly the two dominant frameworks, i.e., Hidden Markov Model (HMM) with Gaussian Mixture model (GMM)-based ASR which dominated the field in the early decades, and deep learning model-based ASR which dominates the techniques used now. In addition, noisy robustness, which is one of the most important challenges for ASR in real applications, will also be introduced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://voicetra.nict.go.jp/en/.
2.
SPINE: Speech in noisy environments, http://www.speech.sri.com/projects/spine/.
3.
Aurora speech recognition experimental framework, http://aurora.hsnr.de/index-2.html.
4.
Computational hearing in multisource environments (CHiME) challenge, http://spandh.dcs.shef.ac.uk/projects/chime/.
5.
Reverberant voice enhancement and recognition benchmark (REVERB) challenge, https://reverb2014.dereverberation.com/.

References

Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2016 conversational speech recognition system. Microsoft Technical Report MSR-TR-2017-39. http://arxiv.org/pdf/1708.06073.pdf
Dixon, P.R., Hori, C., Kashioka, H.: Development of the SprinTra WFST speech decoder. NICT Res. J., 15–20 (2012)
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Cho, K., Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: The 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, SSST-8 (2014)
Google Scholar
Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015). https://doi.org/10.1109/icassp.2015.7178838
Csáji, B.C.: Approximation with artificial neural networks. Faculty of Sciences; Eötvös Loránd University, Hungary (2001)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Proceedings Advances in Neural Information Processing Systems (NIPS) (2012)
Google Scholar
Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In: Proceedings of NIPS (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Graves, A., Fernandez, S., Gomez, F., Shmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2006)
Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2014)
Google Scholar
Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv preprint arXiv:14121602 (2014)
Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: Proceedings of IEEE-ASRU (2015)
Google Scholar
Kanda, N., Lu, X., Kawai, H.: Maximum a posteriori based decoding for CTC acoustic models. In: Proceedings of INTERSPEECH, pp. 1868–1872 (2016)
Google Scholar
Boll, S.F.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Audio Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Audio Speech Signal Process. 32, 1109–1121 (1984)
Article Google Scholar
Lu, X., Tsao, Y., Matsuda, S, Hori, C.: Speech enhancement based on deep denoising autoencoder. In: Proceedings of Interspeech ’13, pp. 436–440, August 2013
Google Scholar
Yoshioka, T., Nakatani, T.: Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707–2720 (2012)
Article Google Scholar
Wölfel, M., McDonough, M.: Minimum variance distortionless response spectral estimation. IEEE Signal Process. Mag. 22(5) (2005)
Google Scholar
Liao, H.: Speaker adaptation of context dependent deep neural networks. In: Proceedings of ICASSP ’13, pp. 7947–7951, May 2013
Google Scholar
Seltzer, M., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of ICASSP ’13, pp. 7398–7402, May 2013
Google Scholar
Wang, Z., Wang, D.: A joint training framework for robust automatic speech recognition. IEEE/ACM Transa. Audio Speech Lang. Process. (2016)
Google Scholar
Li, L., Sim, K.C.: Improving robustness of deep neural networks via spectral masking for automatic speech recognition. In: Proceedings of ASRU ’13, pp. 279–284, December 2013
Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Speech Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology, Kyoto, Japan
Xugang Lu, Sheng Li & Masakiyo Fujimoto

Authors

Xugang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Masakiyo Fujimoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xugang Lu .

Editor information

Editors and Affiliations

Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology, Kyoto, Japan
Yutaka Kidawara
Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology, Kyoto, Japan
Eiichiro Sumita
Advanced Speech Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology, Kyoto, Japan
Hisashi Kawai

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lu, X., Li, S., Fujimoto, M. (2020). Automatic Speech Recognition. In: Kidawara, Y., Sumita, E., Kawai, H. (eds) Speech-to-Speech Translation. SpringerBriefs in Computer Science. Springer, Singapore. https://doi.org/10.1007/978-981-15-0595-9_2

Download citation

DOI: https://doi.org/10.1007/978-981-15-0595-9_2
Published: 23 November 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0594-2
Online ISBN: 978-981-15-0595-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics