Skip to main content

Deep Dynamic Models for Learning Hidden Representations of Speech Features

  • Chapter
  • First Online:
Speech and Audio Processing for Coding, Enhancement and Recognition

Abstract

Deep hierarchical structure with multiple layers of hidden space in human speech is intrinsically connected to its dynamic characteristics manifested in all levels of speech production and perception. The desire and an attempt to capitalize on a (superficial) understanding of this deep speech structure helped ignite the recent surge of interest in the deep learning approach to speech recognition and related applications, and a more thorough understanding of the deep structure of speech dynamics and the related computational representations is expected to further advance the research progress in speech technology. In this chapter, we first survey a series of studies on representing speech in a hidden space using dynamic systems and recurrent neural networks, emphasizing different ways of learning the model parameters and subsequently the hidden feature representations of time-varying speech data. We analyze and summarize this rich set of deep, dynamic speech models into two major categories: (1) top-down, generative models adopting localist representations of speech classes and features in the hidden space; and (2) bottom-up, discriminative models adopting distributed representations. With detailed examinations of and comparisons between these two types of models, we focus on the localist versus distributed representations as their respective hallmarks and defining characteristics. Future directions are discussed and analyzed about potential strategies to leverage the strengths of both the localist and distributed representations while overcoming their respective weaknesses, beyond blind integration of the two by using the generative model to pre-train the discriminative one as a popular method of training deep neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A. Acero, L. Deng, T. Kristjansson, J. Zhang, HMM adaptation using vector taylor series for noisy speech recognition, in Proceedings of International Conference on Spoken Language Processing (2000), pp. 869–872

    Google Scholar 

  2. J. Baker, Stochastic modeling for automatic speech recognition, in Speech Recognition, ed. by D. Reddy (Academic, New York, 1976)

    Google Scholar 

  3. J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Research developments and directions in speech recognition and understanding, part i. IEEE Signal Process. Mag. 26(3), 75–80 (2009)

    Article  Google Scholar 

  4. J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Updated MINDS report on speech recognition and understanding. IEEE Signal Process. Mag. 26(4), 78–85 (2009)

    Article  Google Scholar 

  5. L. Baum, T. Petrie, Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  6. Y. Bengio, N. Boulanger, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013

    Google Scholar 

  7. Y. Bengio, N. Boulanger-Lewandowski, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013

    Google Scholar 

  8. J. Bilmes, Buried markov models: a graphical modeling approach to automatic speech recognition. Comput. Speech Lang. 17, 213–231 (2003)

    Article  Google Scholar 

  9. J. Bilmes, What HMMs can do. IEICE Trans. Inf. Syst. E89-D(3), 869–891 (2006)

    Article  Google Scholar 

  10. M. Boden, A guide to recurrent neural networks and backpropagation. Tech. rep., T2002:03, SICS (2002)

    Google Scholar 

  11. H. Bourlard, N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. The Kluwer International Series in Engineering and Computer Science, vol. 247 (Kluwer Academic, Boston, 1994)

    Google Scholar 

  12. J. Bridle, L. Deng, J. Picone, H. Richards, J. Ma, T. Kamm, M. Schuster, S. Pike, R. Reagan, An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Langauge Engineering, CLSP (Johns Hopkins, 1998)

    Google Scholar 

  13. J. Chen, L. Deng, A primal-dual method for training recurrent neural networks constrained by the echo-state property, in Proceedings of ICLR (2014)

    Google Scholar 

  14. J.-T. Chien, C.-H. Chueh, Dirichlet class language models for speech recognition. IEEE Trans. Audio Speech Lang. Process. 27, 43–54 (2011)

    Google Scholar 

  15. G. Dahl, D. Yu, L. Deng, A. Acero, Large vocabulary continuous speech recognition with context-dependent DBN-HMMs, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011)

    Google Scholar 

  16. G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  17. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  18. L. Deng, A generalized hidden markov model with state-conditioned trend functions of time for the speech signal. Signal Process. 27(1), 65–78 (1992)

    Article  MATH  Google Scholar 

  19. L. Deng, A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Commun. 24(4), 299–323 (1998)

    Article  Google Scholar 

  20. L. Deng, Articulatory features and associated production models in statistical speech recognition, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 214–224

    Google Scholar 

  21. L. Deng, Computational models for speech production, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 199–213

    Google Scholar 

  22. L. Deng, Switching dynamic system models for speech articulation and acoustics, in Mathematical Foundations of Speech and Language Processing (Springer, New York, 2003), pp. 115–134

    Google Scholar 

  23. L. Deng, Dynamic Speech Models—Theory, Algorithm, and Applications (Morgan and Claypool, San Rafael, 2006)

    Google Scholar 

  24. L. Deng, M. Aksmanovic, D. Sun, J. Wu, Speech recognition using hidden Markov models with polynomial regression functions as non-stationary states. IEEE Trans. Acoust. Speech Signal Process. 2(4), 101–119 (1994)

    Google Scholar 

  25. L. Deng, J. Chen, Sequence classification using high-level features extracted from deep neural networks, in Proceedings of ICASSP (2014)

    Google Scholar 

  26. L. Deng, J. Droppo, A. Acero, A Bayesian approach to speech feature enhancement using the dynamic cepstral prior, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2002), pp. I-829–I-832

    Google Scholar 

  27. L. Deng, K. Hassanein, M. Elmasry, Analysis of the correlation structure for a neural predictive model with application to speech recognition. Neural Netw. 7(2), 331–339 (1994)

    Article  Google Scholar 

  28. L. Deng, G. Hinton, B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: an overview, in Proceedings of IEEE ICASSP, Vancouver, 2013

    Google Scholar 

  29. L. Deng, G. Hinton, D. Yu, Deep learning for speech recognition and related applications, in NIPS Workshop, Whistler, 2009

    Google Scholar 

  30. L. Deng, P. Kenny, M. Lennig, V. Gupta, F. Seitz, P. Mermelsten, Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans. Acoust. Speech Signal Process. 39(7), 1677–1681 (1991)

    Article  Google Scholar 

  31. L. Deng, L. Lee, H. Attias, A. Acero, Adaptive kalman filtering and smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio Speech Lang. Process. 15(1), 13–23 (2007)

    Article  Google Scholar 

  32. L. Deng, M. Lennig, F. Seitz, P. Mermelstein, Large vocabulary word recognition using context-dependent allophonic hidden markov models. Comput. Speech Lang. 4, 345–357 (1991)

    Article  Google Scholar 

  33. L. Deng, X. Li, Machine learning paradigms in speech recognition: an overview. IEEE Trans. Audio Speech Lang. Process. 21(5), 1060–1089 (2013)

    Article  Google Scholar 

  34. L. Deng, J. Ma, A statistical coarticulatory model for the hidden vocal-tract-resonance dynamics, in EUROSPEECH (1999), pp. 1499–1502

    Google Scholar 

  35. L. Deng, J. Ma, Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics. J. Acoust. Soc. Am. 108, 3036–3048 (2000)

    Article  Google Scholar 

  36. L. Deng, D. O’Shaughnessy, Speech Processing—A Dynamic and Optimization-Oriented Approach (Marcel Dekker, New York, 2003)

    Google Scholar 

  37. L. Deng, G. Ramsay, D. Sun, Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)

    Article  Google Scholar 

  38. L. Deng, D. Yu, Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2007), pp. 445–448

    Google Scholar 

  39. L. Deng, D. Yu, A. Acero, A bidirectional target filtering model of speech coarticulation: two-stage implementation for phonetic recognition. IEEE Trans. Speech Audio Process. 14, 256–265 (2006)

    Article  Google Scholar 

  40. L. Deng, D. Yu, A. Acero, Structured speech modeling. IEEE Trans. Speech Audio Process. 14, 1492–1504 (2006)

    Article  Google Scholar 

  41. P. Divenyi, S. Greenberg, G. Meyer, Dynamics of Speech Production and Perception (IOS Press, Amsterdam, 2006)

    Google Scholar 

  42. J. Droppo, A. Acero, Noise robust speech recognition with a switching linear dynamic model, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-953–I-956

    Google Scholar 

  43. E. Fox, E. Sudderth, M. Jordan, A. Willsky, Bayesian nonparametric methods for learning markov switching processes. IEEE Signal Process. Mag. 27(6), 43–54 (2010)

    Google Scholar 

  44. B. Frey, L. Deng, A. Acero, T. Kristjansson, Algonquin: iterating laplaces method to remove multiple types of acoustic distortion for robust speech recognition, in Proceedings of Eurospeech (2000)

    Google Scholar 

  45. M. Gales, S. Young, Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)

    Article  Google Scholar 

  46. Z. Ghahramani, G.E. Hinton, Variational learning for switching state-space models. Neural Comput. 12, 831–864 (2000)

    Article  Google Scholar 

  47. Y. Gong, I. Illina, J.-P. Haton, Modeling long term variability information in mixture stochastic trajectory framework, in Proceedings of International Conference on Spoken Language Processing (1996)

    Google Scholar 

  48. A. Graves, Sequence transduction with recurrent neural networks, in Representation Learning Workshop, ICML (2012)

    Google Scholar 

  49. A. Graves, A. Mahamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of ICASSP, Vancouver, 2013

    Google Scholar 

  50. G. E. Hinton, “A practical guide to training restricted Boltzmann machines,” in Technical report 2010-003, Machine Learning Group, University of Toronto, 2010.

    Google Scholar 

  51. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  52. G. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  53. G. Hinton, R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  54. W. Holmes, M. Russell, Probabilistic-trajectory segmental HMMs. Comput. Speech Lang. 13, 3–37 (1999)

    Article  Google Scholar 

  55. X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Upper Saddle River, New Jersey 07458)

    Google Scholar 

  56. H. Jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD Report 159, GMD - German National Research Institute for Computer Science (2002)

    Google Scholar 

  57. F. Jelinek, Continuous speech recognition by statistical methods. Proc. IEEE 64(4), 532–557 (1976)

    Article  Google Scholar 

  58. B.-H. Juang, S.E. Levinson, M.M. Sondhi, Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains. IEEE Trans. Inf. Theory 32(2), 307–309 (1986)

    Article  Google Scholar 

  59. B. Kingsbury, T. Sainath, H. Soltau, Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free optimization, in Proceedings of Interspeech (2012)

    Google Scholar 

  60. H. Larochelle, Y. Bengio, Classification using discriminative restricted Boltzmann machines, in Proceedings of the 25th International Conference on Machine learning (ACM, New York, 2008), pp. 536–543

    Google Scholar 

  61. L. Lee, H. Attias, L. Deng, Variational inference and learning for segmental switching state space models of hidden speech dynamics, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2003), pp. I-872–I-875

    Google Scholar 

  62. L.J. Lee, P. Fieguth, L. Deng, A functional articulatory dynamic model for speech production, in Proceedings of ICASSP, Salt Lake City, vol. 2, 2001, pp. 797–800

    Google Scholar 

  63. S. Liu, K. Sim, Temporally varying weight regression: a semi-parametric trajectory model for automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22(1) 151–160 (2014)

    Article  Google Scholar 

  64. S.M. Siniscalchia, D. Yu, L. Deng, C.-H. Lee, Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106, 148–157 (2013)

    Article  Google Scholar 

  65. J. Ma, L. Deng, A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech. Comput. Speech Lang. 14, 101–104 (2000)

    Article  Google Scholar 

  66. J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Process. 11(6), 590–602 (2003)

    Article  Google Scholar 

  67. J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Lang. Process 11(6), 590–602 (2004)

    Article  Google Scholar 

  68. J. Ma, L. Deng, Target-directed mixture dynamic models for spontaneous speech recognition. IEEE Trans. Audio Speech Process. 12(1), 47–58 (2004)

    Article  Google Scholar 

  69. A.L. Maas, Q. Le, T.M. O’Neil, O. Vinyals, P. Nguyen, A.Y. Ng, Recurrent neural networks for noise reduction in robust asr, in Proceedings of INTERSPEECH, Portland, 2012

    Google Scholar 

  70. J. Martens, I. Sutskever, Learning recurrent neural networks with hessian-free optimization, in Proceedings of ICML, Bellevue, 2011, pp. 1033–1040

    Google Scholar 

  71. G. Mesnil, X. He, L. Deng, Y. Bengio, Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding, in Proceedings of INTERSPEECH, Lyon, 2013

    Google Scholar 

  72. B. Mesot, D. Barber, Switching linear dynamical systems for noise robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(6), 1850–1858 (2007)

    Article  Google Scholar 

  73. T. Mikolov, Statistical language models based on neural networks, Ph.D. thesis, Brno University of Technology, 2012

    Google Scholar 

  74. T. Mikolov, A. Deoras, D. Povey, L. Burget, J. Cernocky, Strategies for training large scale neural network language models, in Proceedings of IEEE ASRU (IEEE, Honolulu, 2011), pp. 196–201

    Google Scholar 

  75. T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural network based language model, in Proceedings of INTERSPEECH, Makuhari, 2010, pp. 1045–1048

    Google Scholar 

  76. T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, S. Khudanpur, Extensions of recurrent neural network language model, in Proceedings of IEEE ICASSP, Prague, 2011, pp. 5528–5531

    Google Scholar 

  77. A. Mohamed, G. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)

    Article  Google Scholar 

  78. A. Mohamed, G.E. Dahl, G.E. Hinton, Deep belief networks for phone recognition, in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications (2009)

    Google Scholar 

  79. A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, M. Picheny, Deep belief networks using discriminative features for phone recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011), pp. 5060–5063

    Google Scholar 

  80. N. Morgan, Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 7–13 (2012)

    Article  Google Scholar 

  81. M. Ostendorf, V. Digalakis, O. Kimball, From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Trans. Speech Audio Process. 4(5), 360–378 (1996)

    Article  Google Scholar 

  82. M. Ostendorf, A. Kannan, O. Kimball, J. Rohlicek, Continuous word recognition based on the stochastic segment model, in Proceedings of DARPA Workshop CSR (1992)

    Google Scholar 

  83. E. Ozkan, I. Ozbek, M. Demirekler, Dynamic speech spectrum representation and tracking variable number of vocal tract resonance frequencies with time-varying dirichlet process mixture models. IEEE Trans. Audio Speech Lang. Process. 17(8), 1518–1532 (2009)

    Article  Google Scholar 

  84. R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in Proceedings of ICML, Atlanta, 2013

    Google Scholar 

  85. V. Pavlovic, B. Frey, T. Huang, Variational learning in mixed-state dynamic graphical models, in Proceedings of UAI, Stockholm, 1999, pp. 522–530

    Google Scholar 

  86. J. Picone, S. Pike, R. Regan, T. Kamm, J. Bridle, L. Deng, Z. Ma, H. Richards, M. Schuster, Initial evaluation of hidden dynamic models on conversational speech, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (1999)

    Google Scholar 

  87. G. Puskorius, L. Feldkamp, Neurocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks. IEEE Trans. Neural Netw. 5(2), 279–297 (1998)

    Article  Google Scholar 

  88. L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)

    Google Scholar 

  89. S. Rennie, J. Hershey, P. Olsen, Single-channel multitalker speech recognition—graphical modeling approaches. IEEE Signal Process.Mag. 33, 66–80 (2010)

    Google Scholar 

  90. A.J. Robinson, An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Netw. 5(2), 298–305 (1994)

    Article  Google Scholar 

  91. A. Rosti, M. Gales, Rao-blackwellised gibbs sampling for switching linear dynamical systems, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-809–I-812

    Google Scholar 

  92. M. Russell, P. Jackson, A multiple-level linear/linear segmental HMM with a formant-based intermediate layer. Comput. Speech Lang. 19, 205–225 (2005)

    Article  Google Scholar 

  93. T. Sainath, B. Kingsbury, H. Soltau, B. Ramabhadran, Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio Speech Lang. Process. 21(11), 2267–2276 (2013)

    Article  Google Scholar 

  94. F. Seide, G. Li, X. Chen, D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011 (Waikoloa, HI, USA), pp. 24–29

    Google Scholar 

  95. X. Shen, L. Deng, Maximum likelihood in statistical estimation of dynamical systems: decomposition algorithm and simulation results. Signal Process. 57, 65–79 (1997)

    Article  MATH  Google Scholar 

  96. K. N. Stevens, Acoustic phonetics, Vol. 30, MIT Press, 2000.

    Google Scholar 

  97. V. Stoyanov, A. Ropson, J. Eisner, Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure, in Proceedings of AISTAT (2011)

    Google Scholar 

  98. I. Suskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of 28th International Conference on Machine Learning (2011)

    Google Scholar 

  99. I. Sutskever, Training recurrent neural networks, Ph.D. thesis, University of Toronto, 2013

    Google Scholar 

  100. I. Sutskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of ICML, Bellevue, 2011, pp. 1017–1024

    Google Scholar 

  101. R. Togneri, L. Deng, Joint state and parameter estimation for a target-directed nonlinear dynamic system model. IEEE Trans. Signal Process. 51(12), 3061–3070 (2003)

    Article  MathSciNet  Google Scholar 

  102. R. Togneri, L. Deng, A state-space model with neural-network prediction for recovering vocal tract resonances in fluent speech from mel-cepstral coefficients. Speech Commun. 48(8), 971–988 (2006)

    Article  Google Scholar 

  103. F. Triefenbach, A. Jalalvand, K. Demuynck, J.-P. Martens, Acoustic modeling with hierarchical reservoirs. EEE Trans. Audio Speech Lang. Process. 21(11 ), 2439–2450 (2013)

    Article  Google Scholar 

  104. S. Wright, D. Kanevsky, L. Deng, X. He, G. Heigold, H. Li, Optimization algorithms and applications for speech and language processing. IEEE Trans. Audio Speech Lang. Process. 21(11), 2231–2243 (2013)

    Article  Google Scholar 

  105. X. Xing, M. Jordan, S. Russell, A generalized mean field algorithm for variational inference in exponential families, in Proceedings of UAI (2003)

    Google Scholar 

  106. D. Yu, L. Deng, Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation. Comput. Speech Lang. 27, 72–87 (2007)

    Article  Google Scholar 

  107. D. Yu, L. Deng, Discriminative pretraining of deep neural networks, US Patent 20130138436 A1, 2013

    Google Scholar 

  108. D. Yu, L. Deng, G. Dahl, Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition, in NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)

    Google Scholar 

  109. D. Yu, F. Seide, G. Li, L. Deng, Exploiting sparseness in deep neural networks for large vocabulary speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012), pp. 4409–4412

    Google Scholar 

  110. D. Yu, S. Siniscalchi, L. Deng, C. Lee, Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012)

    Google Scholar 

  111. H. Zen, K. Tokuda, T. Kitamura, An introduction of trajectory model into HMM-based speech synthesis, in Proceedings of ISCA SSW5 (2004), pp. 191–196

    Google Scholar 

  112. L. Zhang, S. Renals, Acoustic-articulatory modelling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Deng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this chapter

Cite this chapter

Deng, L., Togneri, R. (2015). Deep Dynamic Models for Learning Hidden Representations of Speech Features. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1456-2_6

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1455-5

  • Online ISBN: 978-1-4939-1456-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics