HMMs and Related Speech Recognition Technologies

Young, Steve

doi:10.1007/978-3-540-49127-9_27

HMMs and Related Speech Recognition Technologies

Steve Young Prof.⁴

Chapter

8209 Accesses
19 Citations

Part of the book series: Springer Handbooks ((SHB))

Abstract

Almost all present-day continuous speech recognition (CSR) systems are based on hidden Markov models (HMMs). Although the fundamentals of HMM-based CSR have been understood for several decades, there has been steady progress in refining the technology both in terms of reducing the impact of the inherent assumptions, and in adapting the models for specific applications and environments. The aim of this chapter is to review the core architecture of an HMM-based CSR system and then outline the major areas of refinement incorporated into modern systems.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 579.00; Price excludes VAT (USA)

Hardcover Book: USD 729.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Abbreviations

ASR:: automatic speech recognition
CAT:: cluster adaptive training
CDF:: cumulative distribution function
CMLLR:: constrained MLLR
CSR:: continuous speech recognition
DARPA:: Defense Advanced Research Projects Agency
EM:: expectation maximization
EMLLT:: extended maximum likelihood linear transform
FFT:: fast Fourier transform
GMM:: Gaussian mixture model
HLDA:: heteroscedastic LDA
HMM:: hidden Markov models
HTK:: hidden Markov model toolkit
LDA:: linear discriminant analysis
LVCSR:: large vocabulary continuous speech recognition
MAP:: maximum a posteriori
MCE:: minimum classification error
MFCC:: mel-filter cepstral coefficient
ML:: maximum-likelihood
MLLR:: maximum-likelihood linear regression
MMI:: maximum mutual information
MPE:: minimum phone error
NSA:: National Security Agency
PLP:: perceptual linear prediction
RPA:: raw phone accuracy
SAT:: speaker-adaptive trained
SI:: speech intelligibility
SPAM:: subspace-constrained precision and means
STC:: sinusoidal transform coder
VTLN:: vocal-tract-length normalization
WER:: word error rate

References

J.K. Baker: The dragon system - an overview, IEEE Trans. Acoust. Speech Signal Process. 23(1), 24-29 (1975)
Article Google Scholar
F. Jelinek: Continuous speech recognition by statistical methods, Proc. IEEE 64(4), 532-556 (1976)
Article Google Scholar
B.T. Lowerre: The Harpy Speech Recognition System, Ph.D. Dissertation (Carnegie Mellon, Pittsburgh 1976)
Google Scholar
L.R. Rabiner, B.-H. Juang, S.E. Levinson, M.M. Sondhi: Recognition of isolated digits using HMMs with continuous mixture densities, AT&T Tech. J. 64(6), 1211-1233 (1985)
Article Google Scholar
L.R. Rabiner: A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77(2), 257-286 (1989)
Article Google Scholar
P.J. Price, W. Fisher, J. Bernstein, D.S. Pallet: The DARPA 1000-word resource management database for continuous speech recognition, Proc. IEEE ICASSP 1, 651-654 (1988)
Google Scholar
S.J. Young, L.L. Chase: Speech recognition evaluation: A review of the US CSR and LVCSR programmes, Comput. Speech Lang. 12(4), 263-279 (1998)
Article Google Scholar
D.S. Pallet, J.G. Fiscus, J. Garofolo, A. Martin, M. Przybocki: 1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures, Tech. Rep. (National Institute of Standards and Technology, Gaithersburg 1998)
Google Scholar
J.J. Godfrey, E.C. Holliman, J. McDaniel: Switchboard, Proc. IEEE ICASSP 1, 517-520 (1992)
Google Scholar
G. Evermann, H.Y. Chan, M.J.F. Gales, T. Hain, X. Liu, D. Mrva, L. Wang, P. Woodland: Development of the 2003 CU-HTK Conversational Telephone Speech Transcription System, Proc. IEEE ICASSP (2004)
Google Scholar
H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, G. Zweig: The IBM 2004 conversational telephony system for rich transcription, Proc. IEEE ICASSP (2005)
Google Scholar
S.J. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland: The HTK Book Version 3.4 (Cambridge University, Cambridge 2006), http://htk.eng.cam.ac.uk
Google Scholar
S.J. Young: Large vocabulary continuous speech recognition, IEEE Signal Process. Mag. 13(5), 45-57 (1996)
Article Google Scholar
S.B. Davis, P. Mermelstein: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process. 28(4), 357-366 (1980)
Article Google Scholar
H. Hermansky: Perceptual linear predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87(4), 1738-1752 (1990)
Article Google Scholar
A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc. B 39, 1-38 (1977)
MathSciNet MATH Google Scholar
S.J. Young, J.J. Odell, P.C. Woodland: Tree-based state tying for high accuracy acoustic modelling, Proc. Human Language Technology Workshop (Morgan Kaufman, San Francisco 1994) pp. 307-312
Chapter Google Scholar
X. Luo, F. Jelinek: Probabilistic classification of HMM states for large vocabulary, Proc. Int. Conf. Acoust. Speech Signal Process. (1999) pp. 2044-2047
Google Scholar
S.M. Katz: Estimation of probabilities from sparse data for the language model component of a speech recogniser, IEEE Trans. ASSP 35(3), 400-401 (1987)
Article Google Scholar
H. Ney, U. Essen, R. Kneser: On structuring probabilistic dependences in stochastic language modelling, Comput. Speech Lang. 8(1), 1-38 (1994)
Article Google Scholar
S.F. Chen, J. Goodman: An empirical study of smoothing techniques for language modelling, Comput. Speech Lang. 13, 359-394 (1999)
Article Google Scholar
P.F. Brown, V.J. Della Pietra, P.V. de Souza, J.C. Lai, R.L. Mercer: Class-based n-gram models of natural language, Comput. Linguist. 18(4), 467-479 (1992)
Google Scholar
R. Kneser, H. Ney: Improved clustering techniques for class-based statistical language modelling, Proc. Eurospeech 93, 973-976 (1993)
Google Scholar
S. Martin, J. Liermann, H. Ney: Algorithms for bigram and trigram word clustering, Proc. Eurospeech 2, 1253-1256 (1995)
Google Scholar
S.J. Young, N.H. Russell, J.H.S. Thornton: Token Passing: A Simple Conceptual Model for Connected Speech Recognition Systems, Tech. Rep. CUED/F-INFENG/TR38 (Cambridge University, Cambridge 1989)
Google Scholar
K. Demuynck, J. Duchateau, D. van Compernolle: A static lexicon network representation for cross-word context dependent phones, Proc. Eurospeech 97, 143-146 (1997)
Google Scholar
S.J. Young: Generating multiple solutions from connected word DP recognition algorithms, Proc. IOA Autumn Conf. 6, 351-354 (1984)
Google Scholar
H. Thompson: Best-first enumeration of paths through a lattice - an active chart parsing solution, Comput. Speech Lang. 4(3), 263-274 (1990)
Article Google Scholar
F. Richardson, M. Ostendorf, J.R. Rohlicek: Lattice-based search strategies for large vocabulary recognition, Proc. IEEE ICASSP 1, 576-579 (1995)
Google Scholar
L. Mangu, E. Brill, A. Stolcke: Finding consensus among words: Lattice-based word error minimisation, Comput. Speech Lang. 14(4), 373-400 (2000)
Article Google Scholar
G. Evermann, P.C. Woodland: Posterior probability decoding confidence estimation and system combination, Proc. Speech Transcription Workshop (2000)
Google Scholar
G. Evermann, P.C. Woodland: Large vocabulary decoding and confidence estimation using word posterior probabilities, Proc. IEEE ICASSP (2000) pp. 1655-1658
Google Scholar
V. Goel, S. Kumar, B. Byrne: Segmental minimum bayes-risk ASR voting strategies, Proc. ICSLP (2000)
Google Scholar
J. Fiscus: A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER), Proc. IEEE ASRU Workshop (1997) pp. 347-352
Google Scholar
D. Hakkani-Tur, F. Bechet, G. Riccardi, G. Tur: Beyond ASR 1-best: Using word confusion networks in spoken language understanding, Comput. Speech Lang. 20(4), 495-514 (2006)
Article Google Scholar
J.J. Odell, V. Valtchev, P.C. Woodland, S.J. Young: A one-pass decoder design for large vocabulary recognition, Proc. Human Language Technology Workshop (Morgan Kaufman, San Francisco 1994) pp. 405-410
Chapter Google Scholar
X. Aubert, H. Ney: Large vocabulary continuous speech recognition using word graphs, Proc. IEEE ICASSP 1, 49-52 (1995)
Google Scholar
M. Mohri, F. Pereira, M. Riley: Weighted finite state transducers in speech recognition, Comput. Speech Lang. 16(1), 69-88 (2002)
Article Google Scholar
F. Jelinek: A fast sequential decoding algorithm using a stack, IBM J. Res. Dev. 13, 675-685 (1969)
Article MathSciNet MATH Google Scholar
D.B. Paul: Algorithms for an optimal A* search and linearizing the search in the stack decoder, Proc. IEEE ICASSP 91, 693-996 (1991)
Google Scholar
A. Nadas: A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood, IEEE Trans. Acoust. Speech Signal Process. 31(4), 814-817 (1983)
Article Google Scholar
B.-H. Juang, S.E. Levinson, M.M. Sondhi: Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Trans. Inform. Theory 32(2), 307-309 (1986)
Article Google Scholar
L.R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer: Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proc. IEEE ICASSP 96, 49-52 (1986)
Google Scholar
V. Valtchev, J.J. Odell, P.C. Woodland, S.J. Young: MMIE training of large vocabulary recognition systems, Speech Commun. 22, 303-314 (1997)
Article Google Scholar
R. Schluter, B. Muller, F. Wessel, H. Ney: Interdependence of language models and discriminative training, Proc. IEEE ASRU Workshop (1999) pp. 119-122
Google Scholar
P. Woodland, D. Povey: Large scale discriminative training of hidden Markov models for speech recognition, Comput. Speech Lang. 16, 25-47 (2002)
Article Google Scholar
W. Chou, C.H. Lee, B.-H. Juang: Minimum error rate training based on N-best string models, Proc. IEEE ICASSP 93, 652-655 (1993)
Article Google Scholar
D. Povey, P. Woodland: Minimum phone error and I-smoothing for improved discriminative training, Proc. IEEE ICASSP 2002, I-105-I-108 (2002)
Article Google Scholar
P.S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo, M.A. Picheny: Decoder selection based on cross-entropies, Proc. IEEE ICASSP 1, 20-23 (1988)
Google Scholar
P.C. Woodland, D. Povey: Large scale discriminative training for speech recognition, ISCA ITRW Automatic Speech Recognition: Challenges for the Millenium (2000) pp. 7-16
Google Scholar
M.J.F. Gales: Semi-tied covariance matrices for hidden Markov models, IEEE Trans. Speech Audio Process. 7(3), 272-281 (1999)
Article Google Scholar
A.V. Rosti, M. Gales: Factor analysed hidden Markov models for speech recognition, Comput. Speech Language 18(2), 181-200 (2004)
Article Google Scholar
S. Axelrod, R. Gopinath, P. Olsen: Modeling with a subspace constraint on inverse covariance matrices, Proc. ICSLP (2002)
Google Scholar
P. Olsen, R. Gopinath: Modeling inverse covariance matrices by basis expansion, Proc. ICSLP (2002)
Google Scholar
N. Kumar, A.G. Andreou: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition, Speech Commun. 26, 283-297 (1998)
Article Google Scholar
M.J.F. Gales: Maximum likelihood multiple subspace projections for hidden Markov models, IEEE Trans. Speech Audio Process. 10(2), 37-47 (2002)
Article Google Scholar
T. Hain, P.C. Woodland, T.R. Niesler, E.W.D. Whittaker: The 1998 HTK system for transcription of conversational telephone speech, Proc. IEEE ICASSP 99, 57-60 (1999)
Google Scholar
G. Saon, A. Dharanipragada, D. Povey: Feature space Gaussianization, Proc. IEEE ICASSP (2004)
Google Scholar
S.S. Chen, R. Gopinath: Gaussianization, Proc. Neural Information Processing Systems (MIT Press, 2000)
Google Scholar
M.J.F. Gales, B. Jia, X. Liu, K.C. Sim, P. Woodland, K. Yu: Development of the CUHTK 2004 RT04 Mandarin conversational telephone speech transcription system, Proc. IEEE ICASSP (2005)
Google Scholar
L. Lee, R.C. Rose: Speaker normalisation using efficient frequency warping procedures, Proc. IEEE ICASSP (1996)
Google Scholar
J. McDonough, W. Byrne, X. Luo: Speaker normalisation with all pass transforms, Proc. ISCLP, Vol. 98 (1998)
Google Scholar
D.Y. Kim, S. Umesh, M.J.F. Gales, T. Hain, P. Woodland: Using VTLN for broadcast news transcription, Proc. ICSLP (2004)
Google Scholar
J.-L. Gauvain, C.-H. Lee: Maximum a posteriori estimation of multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Process. 2(2), 291-298 (1994)
Article Google Scholar
S.M. Ahadi, P.C. Woodland: Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang. 11(3), 187-206 (1997)
Article Google Scholar
K. Shinoda, C.H. Lee: Structural MAP speaker adaptation using hierarchical priors, Proc. ASRU, Vol. 97 (1997)
Google Scholar
C.J. Leggetter, P.C. Woodland: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang. 9(2), 171-185 (1995)
Article Google Scholar
M.J.F. Gales: Maximum likelihood linear transformations for HMM-based speech recognition, Comput. Speech Lang. 12, 75-98 (1998)
Article Google Scholar
F. Wallhof, D. Willett, G. Rigoll: Frame-discriminative and confidence-driven adaptation for LVCSR, Proc. IEEE ICASSP (2000) pp. 1835-1838
Google Scholar
L. Wang, P. Woodland: Discriminative adaptive training using the MPE criterion, Proc. ASRU (2003)
Google Scholar
S. Tsakalidis, V. Doumpiotis, W.J. Byrne: Discriminative linear transforms for feature normalisation and speaker adaptation in HMM estimation, IEEE Trans. Speech Audio Process. 13(3), 367-376 (2005)
Article Google Scholar
P. Woodland, D. Pye, M.J.F. Gales: Iterative unsupervised adaptation using maximum likelihood linear regression, Proc. ICSLP (1996) pp. 1133-1136
Google Scholar
M. Padmanabhan, G. Saon, G. Zweig: Lattice-based unsupervised MLLR for speaker adaptation, Proc. ITRW ASR2000: ASR Challenges for the New Millenium (2000) pp. 128-132
Google Scholar
T.J. Hazen, J. Glass: A comparison of novel techniques for instantaneous speaker adaptation, Proc. Eurospeech 97, 2047-2050 (1997)
Google Scholar
R. Kuhn, L. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Finke, K. Field, M. Contolini: Eigenvoices for speaker adaptation, Proc. ICSLP (1998)
Google Scholar
M.J.F. Gales: Cluster adaptive training of hidden Markov models, IEEE Trans. Speech Audio Process. 8, 417-428 (2000)
Article Google Scholar
K. Yu, M.J.F. Gales: Discriminative cluster adaptive training, IEEE Trans. Speech Audio Process. 14, 1694-1703 (2006)
Article Google Scholar
T. Anastasakos, J. McDonough, R. Schwartz, J. Makhoul: A compact model for speaker adaptive training, Proc. ICSLP (1996)
Google Scholar
R. Sinha, M.J.F. Gales, D.Y. Kim, X. Liu, K.C. Sim, P.C. Woodland: The CU-HTK Mandarin broadcast news transcription system, Proc. IEEE ICASSP (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Head of Information Engineering Division, Cambridge University Engineering Dept, Trumpington Street, CB21PZ, Cambridge, UK
Steve Young Prof.

Authors

Steve Young Prof.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steve Young Prof. .

Editor information

Editors and Affiliations

INRS-EMT, University of Quebec, 800 de la Gauchetiere Ouest, H5A 1K6, Montreal, Quebec, Canada
Jacob Benesty Dr.
Avayalabs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA
M. Mohan Sondhi Ph.D.
Alcatel-Lucent, Bell Laboratories, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA
Yiteng Arden Huang Dr.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Young, S. (2008). HMMs and Related Speech Recognition Technologies. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-49127-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49125-5
Online ISBN: 978-3-540-49127-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics