Abstract
Automatic speech recognition (ASR) has grown tremendously in recent years, with deep learning playing a key role. Simply put, ASR is the task of converting spoken language into computer readable text (Fig. 8.1). It has quickly become ubiquitous today as a useful way to interact with technology, significantly bridging in the gap in human–computer interaction, making it more natural.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
If there are additional labels like speaker and gender, these can also be used in the process. Common Voice does not have these labels, so each utterance is treated independently.
- 5.
Note: It is possible to add specific words to the lexicon by exiting the lexicon-iv.txt file.
References
Herve A Bourlard and Nelson Morgan. Connectionist speech recognition: a hybrid approach. Vol. 247. Springer Science & Business Media, 2012.
Michael Brandstein and Darren Ward. Microphone arrays: signal processing techniques and applications. Springer Science & Business Media, 2013.
Steven B Davis and Paul Mermelstein. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”. In: Readings in speech recognition. Elsevier, 1990, pp. 65–74.
Hynek Hermansky. “Perceptual linear predictive (PLP) analysis of speech”. In: the Journal of the Acoustical Society of America 87.4 (1990), pp. 1738–1752.
Yedid Hoshen, Ron J Weiss, and Kevin W Wilson. “Speech acoustic modeling from raw multichannel waveforms”. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. 2015, pp. 4624–4628.
Navdeep Jaitly and Geoffrey Hinton. “Learning a better representation of speech soundwaves using restricted Boltzmann machines”. In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE. 2011, pp. 5884–5887.
Mehryar Mohri, Fernando Pereira, and Michael Riley. “Speech recognition with weighted finite-state transducers”. In: Springer Handbook of Speech Processing. Springer, 2008, pp. 559–584.
Andrew Cameron Morris, Viktoria Maier, and Phil Green. “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition”. In: Eighth International Conference on Spoken Language Processing. 2004.
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. “Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques”. In: arXiv preprint arXiv:1003.4083 (2010).
Dimitri Palaz, Ronan Collobert, and Mathew Magimai Doss. “Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks”. In: arXiv preprint arXiv:1304.1018 (2013).
Venkata Neelima Parinam, Chandra Sekhar Vootkuri, and Stephen A Zahorian. “Comparison of spectral analysis methods for automatic speech recognition.” In: INTERSPEECH. 2013, pp. 3356–3360.
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. “A time delay neural network architecture for efficient modeling of long temporal contexts”. In: Sixteenth Annual Conference of the International Speech Communication Association. 2015.
Lawrence R Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition”. In: Proceedings of the IEEE 77.2 (1989), pp. 257–286.
Shakti P Rath et al. “Improved feature processing for deep neural networks.” In: Interspeech. 2013, pp. 109–113.
Ralf Schluter et al. “Gammatone features and feature combination for large vocabulary speech recognition”. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. Vol. 4. IEEE. 2007, pp. IV–649.
Zoltán Tüske et al. “Acoustic modeling with deep neural networks using raw time signal for LVCSR”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014.
Steve Young. “A review of large-vocabulary continuous-speech”. In: IEEE signal processing magazine 13.5 (1996), p. 45.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kamath, U., Liu, J., Whitaker, J. (2019). Automatic Speech Recognition. In: Deep Learning for NLP and Speech Recognition . Springer, Cham. https://doi.org/10.1007/978-3-030-14596-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-14596-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14595-8
Online ISBN: 978-3-030-14596-5
eBook Packages: Computer ScienceComputer Science (R0)