Automatic Speech Recognition

Kamath, Uday; Liu, John; Whitaker, James

doi:10.1007/978-3-030-14596-5_8

Uday Kamath⁴,
John Liu⁵ &
James Whitaker⁴

9446 Accesses
2 Citations

Abstract

Automatic speech recognition (ASR) has grown tremendously in recent years, with deep learning playing a key role. Simply put, ASR is the task of converting spoken language into computer readable text (Fig. 8.1). It has quickly become ubiquitous today as a useful way to interact with technology, significantly bridging in the gap in human–computer interaction, making it more natural.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://voice.mozilla.org/en/data.
2.
http://sox.sourceforge.net/.
3.
http://www.speech.cs.cmu.edu/tools/lextool.html.
4.
If there are additional labels like speaker and gender, these can also be used in the process. Common Voice does not have these labels, so each utterance is treated independently.
5.
Note: It is possible to add specific words to the lexicon by exiting the lexicon-iv.txt file.

References

Herve A Bourlard and Nelson Morgan. Connectionist speech recognition: a hybrid approach. Vol. 247. Springer Science & Business Media, 2012.
Google Scholar
Michael Brandstein and Darren Ward. Microphone arrays: signal processing techniques and applications. Springer Science & Business Media, 2013.
Google Scholar
Steven B Davis and Paul Mermelstein. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”. In: Readings in speech recognition. Elsevier, 1990, pp. 65–74.
Google Scholar
Hynek Hermansky. “Perceptual linear predictive (PLP) analysis of speech”. In: the Journal of the Acoustical Society of America 87.4 (1990), pp. 1738–1752.
Article Google Scholar
Yedid Hoshen, Ron J Weiss, and Kevin W Wilson. “Speech acoustic modeling from raw multichannel waveforms”. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. 2015, pp. 4624–4628.
Google Scholar
Navdeep Jaitly and Geoffrey Hinton. “Learning a better representation of speech soundwaves using restricted Boltzmann machines”. In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE. 2011, pp. 5884–5887.
Google Scholar
Mehryar Mohri, Fernando Pereira, and Michael Riley. “Speech recognition with weighted finite-state transducers”. In: Springer Handbook of Speech Processing. Springer, 2008, pp. 559–584.
Google Scholar
Andrew Cameron Morris, Viktoria Maier, and Phil Green. “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition”. In: Eighth International Conference on Spoken Language Processing. 2004.
Google Scholar
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. “Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques”. In: arXiv preprint arXiv:1003.4083 (2010).
Google Scholar
Dimitri Palaz, Ronan Collobert, and Mathew Magimai Doss. “Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks”. In: arXiv preprint arXiv:1304.1018 (2013).
Google Scholar
Venkata Neelima Parinam, Chandra Sekhar Vootkuri, and Stephen A Zahorian. “Comparison of spectral analysis methods for automatic speech recognition.” In: INTERSPEECH. 2013, pp. 3356–3360.
Google Scholar
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. “A time delay neural network architecture for efficient modeling of long temporal contexts”. In: Sixteenth Annual Conference of the International Speech Communication Association. 2015.
Google Scholar
Lawrence R Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition”. In: Proceedings of the IEEE 77.2 (1989), pp. 257–286.
Article Google Scholar
Shakti P Rath et al. “Improved feature processing for deep neural networks.” In: Interspeech. 2013, pp. 109–113.
Google Scholar
Ralf Schluter et al. “Gammatone features and feature combination for large vocabulary speech recognition”. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. Vol. 4. IEEE. 2007, pp. IV–649.
Google Scholar
Zoltán Tüske et al. “Acoustic modeling with deep neural networks using raw time signal for LVCSR”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014.
Google Scholar
Steve Young. “A review of large-vocabulary continuous-speech”. In: IEEE signal processing magazine 13.5 (1996), p. 45.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Digital Reasoning Systems Inc., McLean, VA, USA
Uday Kamath & James Whitaker
Intelluron Corporation, Nashville, TN, USA
John Liu

Authors

Uday Kamath
View author publications
You can also search for this author in PubMed Google Scholar
John Liu
View author publications
You can also search for this author in PubMed Google Scholar
James Whitaker
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kamath, U., Liu, J., Whitaker, J. (2019). Automatic Speech Recognition. In: Deep Learning for NLP and Speech Recognition . Springer, Cham. https://doi.org/10.1007/978-3-030-14596-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-14596-5_8
Published: 11 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14595-8
Online ISBN: 978-3-030-14596-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics