Skip to main content

Deep Neural Network-Hidden Markov Model Hybrid Systems

  • Chapter
  • First Online:
Automatic Speech Recognition

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

In this chapter, we describe one of the several possible ways of exploiting deep neural networks (DNNs) in automatic speech recognition systems—the deep neural network-hidden Markov model (DNN-HMM) hybrid system. The DNN-HMM hybrid system takes advantage of DNN’s strong representation learning power and HMM’s sequential modeling ability, and outperforms conventional Gaussian mixture model (GMM)-HMM systems significantly on many large vocabulary continuous speech recognition tasks. We describe the architecture and the training procedure of the DNN-HMM hybrid system and point out the key components of such systems by comparing a range of system setups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For the desired segmental model, this duration model is very rough.

  2. 2.

    The independence assumption made in the HMM is one of the reasons why language model weighting is needed. Assuming one doubles the features by extracting a feature for each 5 ms instead of 10 ms, the acoustic model score will be doubled and so the language model weight will also need to be doubled.

  3. 3.

    Unfair comparison was conducted in several papers that compare the hybrid DNN/HMM system and the KL-HMM system. The conclusions in these papers are thus questionable.

References

  1. Aradilla, G., Bourlard, H., Magimai-Doss, M.: Using KL-based acoustic models in a large vocabulary recognition task. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 928–931 (2008)

    Google Scholar 

  2. Aradilla, G., Vepa, J., Bourlard, H.: An acoustic model based on kullback-leibler divergence for posterior features. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. IV–657 (2007)

    Google Scholar 

  3. Bahl, L., Brown, P., De Souza, P., Mercer, R.: Maximum mutual information estimation of hidden markov model parameters for speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 11, pp. 49–52 (1986)

    Google Scholar 

  4. Bourlard, H., Morgan, N., Wooters, C., Renals, S.: CDNN: a context dependent neural network for continuous speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 349–352 (1992)

    Google Scholar 

  5. Bourlard, H., Wellekens, C.J.: Links between Markov models and multilayer perceptrons. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 12(12), 1167–1178 (1990)

    Article  Google Scholar 

  6. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4688–4691 (2011)

    Google Scholar 

  7. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  8. Godfrey, J.J., Holliman, E.: Switchboard-1 Release 2. Linguistic Data Consortium, Philadelphia (1997)

    Google Scholar 

  9. Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: telephone speech corpus for research and development. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 517–520 (1992)

    Google Scholar 

  10. Hennebert, J., Ris, C., Bourlard, H., Renals, S., Morgan, N.: Estimation of global posteriors and forward-backward training of hybrid hmm/ann systems (1997)

    Google Scholar 

  11. Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1635–1638 (2000)

    Google Scholar 

  12. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Google Scholar 

  13. Hwang, M., Huang, X.: Shared-distribution hidden Markov models for speech recognition. IEEE Trans. Speech Audio Process. 1(4), 414–420 (1993)

    Google Scholar 

  14. Kapadia, S., Valtchev, V., Young, S.: MMI training for continuous phoneme recognition on the TIMIT database. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 491–494 (1993)

    Google Scholar 

  15. Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2012)

    Google Scholar 

  16. Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)

    Article  Google Scholar 

  17. Morgan, N., Bourlard, H.: Continuous speech recognition using multilayer perceptrons with hidden Markov models. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 413–416 (1990)

    Google Scholar 

  18. Morgan, N., Bourlard, H.A.: Neural networks for statistical recognition of continuous speech. Proc. IEEE 83(5), 742–772 (1995)

    Article  Google Scholar 

  19. Ostendorf, M., Digalakis, V.V., Kimball, O.A.: From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Trans. Speech Audio Process. 4(5), 360–378 (1996)

    Article  Google Scholar 

  20. Povey, D.: Discriminative Training for Large Vocabulary Speech Recognition. Ph.D. thesis, Cambridge University Engineering Department, Cambridge (2003)

    Google Scholar 

  21. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060 (2008)

    Google Scholar 

  22. Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: FMPE: discriminatively trained features for speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 961–964 (2005)

    Google Scholar 

  23. Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I-105 (2002)

    Google Scholar 

  24. Robinson, A.J., Cook, G., Ellis, D.P., Fosler-Lussier, E., Renals, S., Williams, D.: Connectionist speech recognition of broadcast news. Speech Commun. 37(1), 27–45 (2002)

    Article  MATH  Google Scholar 

  25. Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 437–440 (2011)

    Google Scholar 

  26. Senior, A., Heigold, G., Bacchiani, M., Liao, H.: GMM-free DNN training. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014)

    Google Scholar 

  27. Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)

    Google Scholar 

  28. Trentin, E., Gori, M.: A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1), 91–126 (2001)

    Article  MATH  Google Scholar 

  29. Yu, D., Deng, L., Dahl, G.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In: Proceedings of Neural Information Processing Systems (NIPS) Workshop on Deep Learning and Unsupervised Feature Learning (2010)

    Google Scholar 

  30. Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system-from theory to practice. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2709–2712 (2007)

    Google Scholar 

  31. Zhang, B., Matsoukas, S., Schwartz, R.: Discriminatively trained region dependent feature transforms for speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1,pp. I–I (2006)

    Google Scholar 

  32. Zhu, Q., Chen, B., Morgan, N., Stolcke, A.: Tandem connectionist feature extraction for conversational speech recognition. In: Machine Learning for Multimodal Interaction, vol. 3361, pp. 223–231. Springer, Berlin (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Yu .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Yu, D., Deng, L. (2015). Deep Neural Network-Hidden Markov Model Hybrid Systems. In: Automatic Speech Recognition. Signals and Communication Technology. Springer, London. https://doi.org/10.1007/978-1-4471-5779-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5779-3_6

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5778-6

  • Online ISBN: 978-1-4471-5779-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics