Skip to main content

HMMs and Related Speech Recognition Technologies

  • Chapter

Part of the book series: Springer Handbooks ((SHB))

Abstract

Almost all present-day continuous speech recognition (CSR) systems are based on hidden Markov models (HMMs). Although the fundamentals of HMM-based CSR have been understood for several decades, there has been steady progress in refining the technology both in terms of reducing the impact of the inherent assumptions, and in adapting the models for specific applications and environments. The aim of this chapter is to review the core architecture of an HMM-based CSR system and then outline the major areas of refinement incorporated into modern systems.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   579.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   729.00
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Abbreviations

ASR:

automatic speech recognition

CAT:

cluster adaptive training

CDF:

cumulative distribution function

CMLLR:

constrained MLLR

CSR:

continuous speech recognition

DARPA:

Defense Advanced Research Projects Agency

EM:

expectation maximization

EMLLT:

extended maximum likelihood linear transform

FFT:

fast Fourier transform

GMM:

Gaussian mixture model

HLDA:

heteroscedastic LDA

HMM:

hidden Markov models

HTK:

hidden Markov model toolkit

LDA:

linear discriminant analysis

LVCSR:

large vocabulary continuous speech recognition

MAP:

maximum a posteriori

MCE:

minimum classification error

MFCC:

mel-filter cepstral coefficient

ML:

maximum-likelihood

MLLR:

maximum-likelihood linear regression

MMI:

maximum mutual information

MPE:

minimum phone error

NSA:

National Security Agency

PLP:

perceptual linear prediction

RPA:

raw phone accuracy

SAT:

speaker-adaptive trained

SI:

speech intelligibility

SPAM:

subspace-constrained precision and means

STC:

sinusoidal transform coder

VTLN:

vocal-tract-length normalization

WER:

word error rate

References

  1. J.K. Baker: The dragon system - an overview, IEEE Trans. Acoust. Speech Signal Process. 23(1), 24-29 (1975)

    Article  Google Scholar 

  2. F. Jelinek: Continuous speech recognition by statistical methods, Proc. IEEE 64(4), 532-556 (1976)

    Article  Google Scholar 

  3. B.T. Lowerre: The Harpy Speech Recognition System, Ph.D. Dissertation (Carnegie Mellon, Pittsburgh 1976)

    Google Scholar 

  4. L.R. Rabiner, B.-H. Juang, S.E. Levinson, M.M. Sondhi: Recognition of isolated digits using HMMs with continuous mixture densities, AT&T Tech. J. 64(6), 1211-1233 (1985)

    Article  Google Scholar 

  5. L.R. Rabiner: A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77(2), 257-286 (1989)

    Article  Google Scholar 

  6. P.J. Price, W. Fisher, J. Bernstein, D.S. Pallet: The DARPA 1000-word resource management database for continuous speech recognition, Proc. IEEE ICASSP 1, 651-654 (1988)

    Google Scholar 

  7. S.J. Young, L.L. Chase: Speech recognition evaluation: A review of the US CSR and LVCSR programmes, Comput. Speech Lang. 12(4), 263-279 (1998)

    Article  Google Scholar 

  8. D.S. Pallet, J.G. Fiscus, J. Garofolo, A. Martin, M. Przybocki: 1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures, Tech. Rep. (National Institute of Standards and Technology, Gaithersburg 1998)

    Google Scholar 

  9. J.J. Godfrey, E.C. Holliman, J. McDaniel: Switchboard, Proc. IEEE ICASSP 1, 517-520 (1992)

    Google Scholar 

  10. G. Evermann, H.Y. Chan, M.J.F. Gales, T. Hain, X. Liu, D. Mrva, L. Wang, P. Woodland: Development of the 2003 CU-HTK Conversational Telephone Speech Transcription System, Proc. IEEE ICASSP (2004)

    Google Scholar 

  11. H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, G. Zweig: The IBM 2004 conversational telephony system for rich transcription, Proc. IEEE ICASSP (2005)

    Google Scholar 

  12. S.J. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland: The HTK Book Version 3.4 (Cambridge University, Cambridge 2006), http://htk.eng.cam.ac.uk

    Google Scholar 

  13. S.J. Young: Large vocabulary continuous speech recognition, IEEE Signal Process. Mag. 13(5), 45-57 (1996)

    Article  Google Scholar 

  14. S.B. Davis, P. Mermelstein: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process. 28(4), 357-366 (1980)

    Article  Google Scholar 

  15. H. Hermansky: Perceptual linear predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87(4), 1738-1752 (1990)

    Article  Google Scholar 

  16. A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc. B 39, 1-38 (1977)

    MathSciNet  MATH  Google Scholar 

  17. S.J. Young, J.J. Odell, P.C. Woodland: Tree-based state tying for high accuracy acoustic modelling, Proc. Human Language Technology Workshop (Morgan Kaufman, San Francisco 1994) pp. 307-312

    Chapter  Google Scholar 

  18. X. Luo, F. Jelinek: Probabilistic classification of HMM states for large vocabulary, Proc. Int. Conf. Acoust. Speech Signal Process. (1999) pp. 2044-2047

    Google Scholar 

  19. S.M. Katz: Estimation of probabilities from sparse data for the language model component of a speech recogniser, IEEE Trans. ASSP 35(3), 400-401 (1987)

    Article  Google Scholar 

  20. H. Ney, U. Essen, R. Kneser: On structuring probabilistic dependences in stochastic language modelling, Comput. Speech Lang. 8(1), 1-38 (1994)

    Article  Google Scholar 

  21. S.F. Chen, J. Goodman: An empirical study of smoothing techniques for language modelling, Comput. Speech Lang. 13, 359-394 (1999)

    Article  Google Scholar 

  22. P.F. Brown, V.J. Della Pietra, P.V. de Souza, J.C. Lai, R.L. Mercer: Class-based n-gram models of natural language, Comput. Linguist. 18(4), 467-479 (1992)

    Google Scholar 

  23. R. Kneser, H. Ney: Improved clustering techniques for class-based statistical language modelling, Proc. Eurospeech 93, 973-976 (1993)

    Google Scholar 

  24. S. Martin, J. Liermann, H. Ney: Algorithms for bigram and trigram word clustering, Proc. Eurospeech 2, 1253-1256 (1995)

    Google Scholar 

  25. S.J. Young, N.H. Russell, J.H.S. Thornton: Token Passing: A Simple Conceptual Model for Connected Speech Recognition Systems, Tech. Rep. CUED/F-INFENG/TR38 (Cambridge University, Cambridge 1989)

    Google Scholar 

  26. K. Demuynck, J. Duchateau, D. van Compernolle: A static lexicon network representation for cross-word context dependent phones, Proc. Eurospeech 97, 143-146 (1997)

    Google Scholar 

  27. S.J. Young: Generating multiple solutions from connected word DP recognition algorithms, Proc. IOA Autumn Conf. 6, 351-354 (1984)

    Google Scholar 

  28. H. Thompson: Best-first enumeration of paths through a lattice - an active chart parsing solution, Comput. Speech Lang. 4(3), 263-274 (1990)

    Article  Google Scholar 

  29. F. Richardson, M. Ostendorf, J.R. Rohlicek: Lattice-based search strategies for large vocabulary recognition, Proc. IEEE ICASSP 1, 576-579 (1995)

    Google Scholar 

  30. L. Mangu, E. Brill, A. Stolcke: Finding consensus among words: Lattice-based word error minimisation, Comput. Speech Lang. 14(4), 373-400 (2000)

    Article  Google Scholar 

  31. G. Evermann, P.C. Woodland: Posterior probability decoding confidence estimation and system combination, Proc. Speech Transcription Workshop (2000)

    Google Scholar 

  32. G. Evermann, P.C. Woodland: Large vocabulary decoding and confidence estimation using word posterior probabilities, Proc. IEEE ICASSP (2000) pp. 1655-1658

    Google Scholar 

  33. V. Goel, S. Kumar, B. Byrne: Segmental minimum bayes-risk ASR voting strategies, Proc. ICSLP (2000)

    Google Scholar 

  34. J. Fiscus: A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER), Proc. IEEE ASRU Workshop (1997) pp. 347-352

    Google Scholar 

  35. D. Hakkani-Tur, F. Bechet, G. Riccardi, G. Tur: Beyond ASR 1-best: Using word confusion networks in spoken language understanding, Comput. Speech Lang. 20(4), 495-514 (2006)

    Article  Google Scholar 

  36. J.J. Odell, V. Valtchev, P.C. Woodland, S.J. Young: A one-pass decoder design for large vocabulary recognition, Proc. Human Language Technology Workshop (Morgan Kaufman, San Francisco 1994) pp. 405-410

    Chapter  Google Scholar 

  37. X. Aubert, H. Ney: Large vocabulary continuous speech recognition using word graphs, Proc. IEEE ICASSP 1, 49-52 (1995)

    Google Scholar 

  38. M. Mohri, F. Pereira, M. Riley: Weighted finite state transducers in speech recognition, Comput. Speech Lang. 16(1), 69-88 (2002)

    Article  Google Scholar 

  39. F. Jelinek: A fast sequential decoding algorithm using a stack, IBM J. Res. Dev. 13, 675-685 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  40. D.B. Paul: Algorithms for an optimal A* search and linearizing the search in the stack decoder, Proc. IEEE ICASSP 91, 693-996 (1991)

    Google Scholar 

  41. A. Nadas: A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood, IEEE Trans. Acoust. Speech Signal Process. 31(4), 814-817 (1983)

    Article  Google Scholar 

  42. B.-H. Juang, S.E. Levinson, M.M. Sondhi: Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Trans. Inform. Theory 32(2), 307-309 (1986)

    Article  Google Scholar 

  43. L.R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer: Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proc. IEEE ICASSP 96, 49-52 (1986)

    Google Scholar 

  44. V. Valtchev, J.J. Odell, P.C. Woodland, S.J. Young: MMIE training of large vocabulary recognition systems, Speech Commun. 22, 303-314 (1997)

    Article  Google Scholar 

  45. R. Schluter, B. Muller, F. Wessel, H. Ney: Interdependence of language models and discriminative training, Proc. IEEE ASRU Workshop (1999) pp. 119-122

    Google Scholar 

  46. P. Woodland, D. Povey: Large scale discriminative training of hidden Markov models for speech recognition, Comput. Speech Lang. 16, 25-47 (2002)

    Article  Google Scholar 

  47. W. Chou, C.H. Lee, B.-H. Juang: Minimum error rate training based on N-best string models, Proc. IEEE ICASSP 93, 652-655 (1993)

    Article  Google Scholar 

  48. D. Povey, P. Woodland: Minimum phone error and I-smoothing for improved discriminative training, Proc. IEEE ICASSP 2002, I-105-I-108 (2002)

    Article  Google Scholar 

  49. P.S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo, M.A. Picheny: Decoder selection based on cross-entropies, Proc. IEEE ICASSP 1, 20-23 (1988)

    Google Scholar 

  50. P.C. Woodland, D. Povey: Large scale discriminative training for speech recognition, ISCA ITRW Automatic Speech Recognition: Challenges for the Millenium (2000) pp. 7-16

    Google Scholar 

  51. M.J.F. Gales: Semi-tied covariance matrices for hidden Markov models, IEEE Trans. Speech Audio Process. 7(3), 272-281 (1999)

    Article  Google Scholar 

  52. A.V. Rosti, M. Gales: Factor analysed hidden Markov models for speech recognition, Comput. Speech Language 18(2), 181-200 (2004)

    Article  Google Scholar 

  53. S. Axelrod, R. Gopinath, P. Olsen: Modeling with a subspace constraint on inverse covariance matrices, Proc. ICSLP (2002)

    Google Scholar 

  54. P. Olsen, R. Gopinath: Modeling inverse covariance matrices by basis expansion, Proc. ICSLP (2002)

    Google Scholar 

  55. N. Kumar, A.G. Andreou: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition, Speech Commun. 26, 283-297 (1998)

    Article  Google Scholar 

  56. M.J.F. Gales: Maximum likelihood multiple subspace projections for hidden Markov models, IEEE Trans. Speech Audio Process. 10(2), 37-47 (2002)

    Article  Google Scholar 

  57. T. Hain, P.C. Woodland, T.R. Niesler, E.W.D. Whittaker: The 1998 HTK system for transcription of conversational telephone speech, Proc. IEEE ICASSP 99, 57-60 (1999)

    Google Scholar 

  58. G. Saon, A. Dharanipragada, D. Povey: Feature space Gaussianization, Proc. IEEE ICASSP (2004)

    Google Scholar 

  59. S.S. Chen, R. Gopinath: Gaussianization, Proc. Neural Information Processing Systems (MIT Press, 2000)

    Google Scholar 

  60. M.J.F. Gales, B. Jia, X. Liu, K.C. Sim, P. Woodland, K. Yu: Development of the CUHTK 2004 RT04 Mandarin conversational telephone speech transcription system, Proc. IEEE ICASSP (2005)

    Google Scholar 

  61. L. Lee, R.C. Rose: Speaker normalisation using efficient frequency warping procedures, Proc. IEEE ICASSP (1996)

    Google Scholar 

  62. J. McDonough, W. Byrne, X. Luo: Speaker normalisation with all pass transforms, Proc. ISCLP, Vol. 98 (1998)

    Google Scholar 

  63. D.Y. Kim, S. Umesh, M.J.F. Gales, T. Hain, P. Woodland: Using VTLN for broadcast news transcription, Proc. ICSLP (2004)

    Google Scholar 

  64. J.-L. Gauvain, C.-H. Lee: Maximum a posteriori estimation of multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Process. 2(2), 291-298 (1994)

    Article  Google Scholar 

  65. S.M. Ahadi, P.C. Woodland: Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang. 11(3), 187-206 (1997)

    Article  Google Scholar 

  66. K. Shinoda, C.H. Lee: Structural MAP speaker adaptation using hierarchical priors, Proc. ASRU, Vol. 97 (1997)

    Google Scholar 

  67. C.J. Leggetter, P.C. Woodland: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang. 9(2), 171-185 (1995)

    Article  Google Scholar 

  68. M.J.F. Gales: Maximum likelihood linear transformations for HMM-based speech recognition, Comput. Speech Lang. 12, 75-98 (1998)

    Article  Google Scholar 

  69. F. Wallhof, D. Willett, G. Rigoll: Frame-discriminative and confidence-driven adaptation for LVCSR, Proc. IEEE ICASSP (2000) pp. 1835-1838

    Google Scholar 

  70. L. Wang, P. Woodland: Discriminative adaptive training using the MPE criterion, Proc. ASRU (2003)

    Google Scholar 

  71. S. Tsakalidis, V. Doumpiotis, W.J. Byrne: Discriminative linear transforms for feature normalisation and speaker adaptation in HMM estimation, IEEE Trans. Speech Audio Process. 13(3), 367-376 (2005)

    Article  Google Scholar 

  72. P. Woodland, D. Pye, M.J.F. Gales: Iterative unsupervised adaptation using maximum likelihood linear regression, Proc. ICSLP (1996) pp. 1133-1136

    Google Scholar 

  73. M. Padmanabhan, G. Saon, G. Zweig: Lattice-based unsupervised MLLR for speaker adaptation, Proc. ITRW ASR2000: ASR Challenges for the New Millenium (2000) pp. 128-132

    Google Scholar 

  74. T.J. Hazen, J. Glass: A comparison of novel techniques for instantaneous speaker adaptation, Proc. Eurospeech 97, 2047-2050 (1997)

    Google Scholar 

  75. R. Kuhn, L. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Finke, K. Field, M. Contolini: Eigenvoices for speaker adaptation, Proc. ICSLP (1998)

    Google Scholar 

  76. M.J.F. Gales: Cluster adaptive training of hidden Markov models, IEEE Trans. Speech Audio Process. 8, 417-428 (2000)

    Article  Google Scholar 

  77. K. Yu, M.J.F. Gales: Discriminative cluster adaptive training, IEEE Trans. Speech Audio Process. 14, 1694-1703 (2006)

    Article  Google Scholar 

  78. T. Anastasakos, J. McDonough, R. Schwartz, J. Makhoul: A compact model for speaker adaptive training, Proc. ICSLP (1996)

    Google Scholar 

  79. R. Sinha, M.J.F. Gales, D.Y. Kim, X. Liu, K.C. Sim, P.C. Woodland: The CU-HTK Mandarin broadcast news transcription system, Proc. IEEE ICASSP (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steve Young Prof. .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Young, S. (2008). HMMs and Related Speech Recognition Technologies. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds) Springer Handbook of Speech Processing. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49127-9_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-49127-9_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49125-5

  • Online ISBN: 978-3-540-49127-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics