Abstract
Almost all present-day continuous speech recognition (CSR) systems are based on hidden Markov models (HMMs). Although the fundamentals of HMM-based CSR have been understood for several decades, there has been steady progress in refining the technology both in terms of reducing the impact of the inherent assumptions, and in adapting the models for specific applications and environments. The aim of this chapter is to review the core architecture of an HMM-based CSR system and then outline the major areas of refinement incorporated into modern systems.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsAbbreviations
- ASR:
-
automatic speech recognition
- CAT:
-
cluster adaptive training
- CDF:
-
cumulative distribution function
- CMLLR:
-
constrained MLLR
- CSR:
-
continuous speech recognition
- DARPA:
-
Defense Advanced Research Projects Agency
- EM:
-
expectation maximization
- EMLLT:
-
extended maximum likelihood linear transform
- FFT:
-
fast Fourier transform
- GMM:
-
Gaussian mixture model
- HLDA:
-
heteroscedastic LDA
- HMM:
-
hidden Markov models
- HTK:
-
hidden Markov model toolkit
- LDA:
-
linear discriminant analysis
- LVCSR:
-
large vocabulary continuous speech recognition
- MAP:
-
maximum a posteriori
- MCE:
-
minimum classification error
- MFCC:
-
mel-filter cepstral coefficient
- ML:
-
maximum-likelihood
- MLLR:
-
maximum-likelihood linear regression
- MMI:
-
maximum mutual information
- MPE:
-
minimum phone error
- NSA:
-
National Security Agency
- PLP:
-
perceptual linear prediction
- RPA:
-
raw phone accuracy
- SAT:
-
speaker-adaptive trained
- SI:
-
speech intelligibility
- SPAM:
-
subspace-constrained precision and means
- STC:
-
sinusoidal transform coder
- VTLN:
-
vocal-tract-length normalization
- WER:
-
word error rate
References
J.K. Baker: The dragon system - an overview, IEEE Trans. Acoust. Speech Signal Process. 23(1), 24-29 (1975)
F. Jelinek: Continuous speech recognition by statistical methods, Proc. IEEE 64(4), 532-556 (1976)
B.T. Lowerre: The Harpy Speech Recognition System, Ph.D. Dissertation (Carnegie Mellon, Pittsburgh 1976)
L.R. Rabiner, B.-H. Juang, S.E. Levinson, M.M. Sondhi: Recognition of isolated digits using HMMs with continuous mixture densities, AT&T Tech. J. 64(6), 1211-1233 (1985)
L.R. Rabiner: A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77(2), 257-286 (1989)
P.J. Price, W. Fisher, J. Bernstein, D.S. Pallet: The DARPA 1000-word resource management database for continuous speech recognition, Proc. IEEE ICASSP 1, 651-654 (1988)
S.J. Young, L.L. Chase: Speech recognition evaluation: A review of the US CSR and LVCSR programmes, Comput. Speech Lang. 12(4), 263-279 (1998)
D.S. Pallet, J.G. Fiscus, J. Garofolo, A. Martin, M. Przybocki: 1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures, Tech. Rep. (National Institute of Standards and Technology, Gaithersburg 1998)
J.J. Godfrey, E.C. Holliman, J. McDaniel: Switchboard, Proc. IEEE ICASSP 1, 517-520 (1992)
G. Evermann, H.Y. Chan, M.J.F. Gales, T. Hain, X. Liu, D. Mrva, L. Wang, P. Woodland: Development of the 2003 CU-HTK Conversational Telephone Speech Transcription System, Proc. IEEE ICASSP (2004)
H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, G. Zweig: The IBM 2004 conversational telephony system for rich transcription, Proc. IEEE ICASSP (2005)
S.J. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland: The HTK Book Version 3.4 (Cambridge University, Cambridge 2006), http://htk.eng.cam.ac.uk
S.J. Young: Large vocabulary continuous speech recognition, IEEE Signal Process. Mag. 13(5), 45-57 (1996)
S.B. Davis, P. Mermelstein: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process. 28(4), 357-366 (1980)
H. Hermansky: Perceptual linear predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87(4), 1738-1752 (1990)
A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc. B 39, 1-38 (1977)
S.J. Young, J.J. Odell, P.C. Woodland: Tree-based state tying for high accuracy acoustic modelling, Proc. Human Language Technology Workshop (Morgan Kaufman, San Francisco 1994) pp. 307-312
X. Luo, F. Jelinek: Probabilistic classification of HMM states for large vocabulary, Proc. Int. Conf. Acoust. Speech Signal Process. (1999) pp. 2044-2047
S.M. Katz: Estimation of probabilities from sparse data for the language model component of a speech recogniser, IEEE Trans. ASSP 35(3), 400-401 (1987)
H. Ney, U. Essen, R. Kneser: On structuring probabilistic dependences in stochastic language modelling, Comput. Speech Lang. 8(1), 1-38 (1994)
S.F. Chen, J. Goodman: An empirical study of smoothing techniques for language modelling, Comput. Speech Lang. 13, 359-394 (1999)
P.F. Brown, V.J. Della Pietra, P.V. de Souza, J.C. Lai, R.L. Mercer: Class-based n-gram models of natural language, Comput. Linguist. 18(4), 467-479 (1992)
R. Kneser, H. Ney: Improved clustering techniques for class-based statistical language modelling, Proc. Eurospeech 93, 973-976 (1993)
S. Martin, J. Liermann, H. Ney: Algorithms for bigram and trigram word clustering, Proc. Eurospeech 2, 1253-1256 (1995)
S.J. Young, N.H. Russell, J.H.S. Thornton: Token Passing: A Simple Conceptual Model for Connected Speech Recognition Systems, Tech. Rep. CUED/F-INFENG/TR38 (Cambridge University, Cambridge 1989)
K. Demuynck, J. Duchateau, D. van Compernolle: A static lexicon network representation for cross-word context dependent phones, Proc. Eurospeech 97, 143-146 (1997)
S.J. Young: Generating multiple solutions from connected word DP recognition algorithms, Proc. IOA Autumn Conf. 6, 351-354 (1984)
H. Thompson: Best-first enumeration of paths through a lattice - an active chart parsing solution, Comput. Speech Lang. 4(3), 263-274 (1990)
F. Richardson, M. Ostendorf, J.R. Rohlicek: Lattice-based search strategies for large vocabulary recognition, Proc. IEEE ICASSP 1, 576-579 (1995)
L. Mangu, E. Brill, A. Stolcke: Finding consensus among words: Lattice-based word error minimisation, Comput. Speech Lang. 14(4), 373-400 (2000)
G. Evermann, P.C. Woodland: Posterior probability decoding confidence estimation and system combination, Proc. Speech Transcription Workshop (2000)
G. Evermann, P.C. Woodland: Large vocabulary decoding and confidence estimation using word posterior probabilities, Proc. IEEE ICASSP (2000) pp. 1655-1658
V. Goel, S. Kumar, B. Byrne: Segmental minimum bayes-risk ASR voting strategies, Proc. ICSLP (2000)
J. Fiscus: A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER), Proc. IEEE ASRU Workshop (1997) pp. 347-352
D. Hakkani-Tur, F. Bechet, G. Riccardi, G. Tur: Beyond ASR 1-best: Using word confusion networks in spoken language understanding, Comput. Speech Lang. 20(4), 495-514 (2006)
J.J. Odell, V. Valtchev, P.C. Woodland, S.J. Young: A one-pass decoder design for large vocabulary recognition, Proc. Human Language Technology Workshop (Morgan Kaufman, San Francisco 1994) pp. 405-410
X. Aubert, H. Ney: Large vocabulary continuous speech recognition using word graphs, Proc. IEEE ICASSP 1, 49-52 (1995)
M. Mohri, F. Pereira, M. Riley: Weighted finite state transducers in speech recognition, Comput. Speech Lang. 16(1), 69-88 (2002)
F. Jelinek: A fast sequential decoding algorithm using a stack, IBM J. Res. Dev. 13, 675-685 (1969)
D.B. Paul: Algorithms for an optimal A* search and linearizing the search in the stack decoder, Proc. IEEE ICASSP 91, 693-996 (1991)
A. Nadas: A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood, IEEE Trans. Acoust. Speech Signal Process. 31(4), 814-817 (1983)
B.-H. Juang, S.E. Levinson, M.M. Sondhi: Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Trans. Inform. Theory 32(2), 307-309 (1986)
L.R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer: Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proc. IEEE ICASSP 96, 49-52 (1986)
V. Valtchev, J.J. Odell, P.C. Woodland, S.J. Young: MMIE training of large vocabulary recognition systems, Speech Commun. 22, 303-314 (1997)
R. Schluter, B. Muller, F. Wessel, H. Ney: Interdependence of language models and discriminative training, Proc. IEEE ASRU Workshop (1999) pp. 119-122
P. Woodland, D. Povey: Large scale discriminative training of hidden Markov models for speech recognition, Comput. Speech Lang. 16, 25-47 (2002)
W. Chou, C.H. Lee, B.-H. Juang: Minimum error rate training based on N-best string models, Proc. IEEE ICASSP 93, 652-655 (1993)
D. Povey, P. Woodland: Minimum phone error and I-smoothing for improved discriminative training, Proc. IEEE ICASSP 2002, I-105-I-108 (2002)
P.S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo, M.A. Picheny: Decoder selection based on cross-entropies, Proc. IEEE ICASSP 1, 20-23 (1988)
P.C. Woodland, D. Povey: Large scale discriminative training for speech recognition, ISCA ITRW Automatic Speech Recognition: Challenges for the Millenium (2000) pp. 7-16
M.J.F. Gales: Semi-tied covariance matrices for hidden Markov models, IEEE Trans. Speech Audio Process. 7(3), 272-281 (1999)
A.V. Rosti, M. Gales: Factor analysed hidden Markov models for speech recognition, Comput. Speech Language 18(2), 181-200 (2004)
S. Axelrod, R. Gopinath, P. Olsen: Modeling with a subspace constraint on inverse covariance matrices, Proc. ICSLP (2002)
P. Olsen, R. Gopinath: Modeling inverse covariance matrices by basis expansion, Proc. ICSLP (2002)
N. Kumar, A.G. Andreou: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition, Speech Commun. 26, 283-297 (1998)
M.J.F. Gales: Maximum likelihood multiple subspace projections for hidden Markov models, IEEE Trans. Speech Audio Process. 10(2), 37-47 (2002)
T. Hain, P.C. Woodland, T.R. Niesler, E.W.D. Whittaker: The 1998 HTK system for transcription of conversational telephone speech, Proc. IEEE ICASSP 99, 57-60 (1999)
G. Saon, A. Dharanipragada, D. Povey: Feature space Gaussianization, Proc. IEEE ICASSP (2004)
S.S. Chen, R. Gopinath: Gaussianization, Proc. Neural Information Processing Systems (MIT Press, 2000)
M.J.F. Gales, B. Jia, X. Liu, K.C. Sim, P. Woodland, K. Yu: Development of the CUHTK 2004 RT04 Mandarin conversational telephone speech transcription system, Proc. IEEE ICASSP (2005)
L. Lee, R.C. Rose: Speaker normalisation using efficient frequency warping procedures, Proc. IEEE ICASSP (1996)
J. McDonough, W. Byrne, X. Luo: Speaker normalisation with all pass transforms, Proc. ISCLP, Vol. 98 (1998)
D.Y. Kim, S. Umesh, M.J.F. Gales, T. Hain, P. Woodland: Using VTLN for broadcast news transcription, Proc. ICSLP (2004)
J.-L. Gauvain, C.-H. Lee: Maximum a posteriori estimation of multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Process. 2(2), 291-298 (1994)
S.M. Ahadi, P.C. Woodland: Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang. 11(3), 187-206 (1997)
K. Shinoda, C.H. Lee: Structural MAP speaker adaptation using hierarchical priors, Proc. ASRU, Vol. 97 (1997)
C.J. Leggetter, P.C. Woodland: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang. 9(2), 171-185 (1995)
M.J.F. Gales: Maximum likelihood linear transformations for HMM-based speech recognition, Comput. Speech Lang. 12, 75-98 (1998)
F. Wallhof, D. Willett, G. Rigoll: Frame-discriminative and confidence-driven adaptation for LVCSR, Proc. IEEE ICASSP (2000) pp. 1835-1838
L. Wang, P. Woodland: Discriminative adaptive training using the MPE criterion, Proc. ASRU (2003)
S. Tsakalidis, V. Doumpiotis, W.J. Byrne: Discriminative linear transforms for feature normalisation and speaker adaptation in HMM estimation, IEEE Trans. Speech Audio Process. 13(3), 367-376 (2005)
P. Woodland, D. Pye, M.J.F. Gales: Iterative unsupervised adaptation using maximum likelihood linear regression, Proc. ICSLP (1996) pp. 1133-1136
M. Padmanabhan, G. Saon, G. Zweig: Lattice-based unsupervised MLLR for speaker adaptation, Proc. ITRW ASR2000: ASR Challenges for the New Millenium (2000) pp. 128-132
T.J. Hazen, J. Glass: A comparison of novel techniques for instantaneous speaker adaptation, Proc. Eurospeech 97, 2047-2050 (1997)
R. Kuhn, L. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Finke, K. Field, M. Contolini: Eigenvoices for speaker adaptation, Proc. ICSLP (1998)
M.J.F. Gales: Cluster adaptive training of hidden Markov models, IEEE Trans. Speech Audio Process. 8, 417-428 (2000)
K. Yu, M.J.F. Gales: Discriminative cluster adaptive training, IEEE Trans. Speech Audio Process. 14, 1694-1703 (2006)
T. Anastasakos, J. McDonough, R. Schwartz, J. Makhoul: A compact model for speaker adaptive training, Proc. ICSLP (1996)
R. Sinha, M.J.F. Gales, D.Y. Kim, X. Liu, K.C. Sim, P.C. Woodland: The CU-HTK Mandarin broadcast news transcription system, Proc. IEEE ICASSP (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Young, S. (2008). HMMs and Related Speech Recognition Technologies. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-49127-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)