Skip to main content

Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft

  • Chapter
  • First Online:
New Era for Robust Speech Recognition

Abstract

Deep learning (DL) network acoustic modeling has been widely deployed in real-world speech recognition products and services that benefit millions of users. In addition to the general modeling research that academics work on, there are special constraints and challenges that the industry has to face, e.g., the run-time constraint on system deployment, robustness to variations such as the acoustic environment, accents, lack of manual transcription, etc. For large-scale automatic speech recognition applications, this chapter briefly describes selected developments and investigations at Microsoft to make deep learning networks more effective in a production environment, including reducing run-time cost with singular-value-decomposition-based training, improving the accuracy of small-size deep neural networks (DNNs) with teacher–student training, the use of a small amount of parameters for speaker adaptation of acoustic models, improving the robustness to the acoustic environment with variable-component DNN modeling, improving the robustness to accent/dialect with model adaptation and accent-dependent modeling, introducing time and frequency invariance with time–frequency long short-term memory recurrent neural networks, exploring the generalization capability to unseen data with maximum margin sequence training, the use of unsupervised data to improve speech recognition accuracy, and increasing language capability by reusing speech-training material across languages. The outcome has enabled the deployment of DL acoustic models across Microsoft server and client product lines including Windows 10 desktop/laptop/phones, XBOX, and skype speech-to-speech translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, T., Huang, C., Chang, E., Wang, J.: Automatic accent identification using Gaussian mixture models. In: Proceedings of the Workshop on Automatic Speech Recognition and Understanding (2001)

    Google Scholar 

  2. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)

    Google Scholar 

  3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  4. Huang, Y., Gong, Y.: Regularized sequence-level deep neural network model adaptation. In: Proceedings of the Interspeech (2015)

    Google Scholar 

  5. Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7304–7308 (2013)

    Google Scholar 

  6. Huang, Y., Yu, D., Gong, Y., Liu, C.: Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration. In: Proceedings of the Interspeech (2013)

    Google Scholar 

  7. Huang, Y., Yu, D., Liu, C., Gong, Y.: Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In: Proceedings of the Interspeech (2014)

    Google Scholar 

  8. Huang, Y., Wang, Y., Gong, Y.: Semi-supervised training in deep learning acoustic models. In: Proceedings of the Interspeech (2016)

    Book  Google Scholar 

  9. Kumar, K., Liu, C., Yao, K., Gong, Y.: Intermediate-layer DNN adaptation for offline and session-based iterative speaker adaptation. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  10. Kumar, K., Liu, C., Gong, Y.: Non-negative intermediate-layer DNN adaptation for a 10-kb speaker adaptation profile. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016)

    Google Scholar 

  11. Li, J., Yu, D., Huang, J.T., Gong, Y.: Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In: Proceedings of the IEEE Spoken Language Technology Workshop, pp. 131–136 (2012)

    Google Scholar 

  12. Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)

    Article  Google Scholar 

  13. Li, J., Huang, J.T., Gong, Y.: Factorized adaptation for deep neural network. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2014)

    Book  Google Scholar 

  14. Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of the Interspeech (2014)

    Google Scholar 

  15. Li, J., Deng, L., Haeb-Umbach, R., Gong, Y.: Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic, London (2015)

    Google Scholar 

  16. Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: Proceedings of the Workshop on Automatic Speech Recognition and Understanding (2015)

    Book  Google Scholar 

  17. Li, J., Mohamed, A., Zweig, G., Gong, Y.: Exploring multidimensional LSTMs for large vocabulary ASR. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2016)

    Book  Google Scholar 

  18. Miao, Y., Li, J., Wang, Y., Zhang, S., Gong, Y.: Simplifying long short-term memory acoustic models for fast training and decoding. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2016)

    Book  Google Scholar 

  19. Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4273–4276 (2012)

    Google Scholar 

  20. Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH, pp. 338–342 (2014)

    Google Scholar 

  21. Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2013)

    Book  Google Scholar 

  22. Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of the SLT, pp. 171–176 (2014)

    Google Scholar 

  23. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  MATH  Google Scholar 

  24. Xue, J., Li, J., Gong, Y.: Restructuring of deep neural network acoustic models with singular value decomposition. In: Proceedings of the Interspeech, pp. 2365–2369 (2013)

    Google Scholar 

  25. Xue, J., Li, J., Yu, D., Seltzer, M., Gong, Y.: Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6359–6363 (2014)

    Google Scholar 

  26. Ye, G., Liu, C., Gong, Y.: Geo-location dependent deep neural network acoustic model for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5870–5874 (2016)

    Google Scholar 

  27. Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2013)

    Book  Google Scholar 

  28. Zhang, S.X., Liu, C., Yao, K., Gong, Y.: Deep neural support vector machines for speech recognition. In: ICASSP, pp. 4275–4279. IEEE, New York (2015)

    Google Scholar 

  29. Zhang, S.X., Zhao, R., Liu, C., Li, J., Gong, Y.: Recurrent support vector machines for speech recognition. In: ICASSP. IEEE, New York (2016)

    Book  Google Scholar 

  30. Zhao, R., Li, J., Gong, Y.: Variable-activation and variable-input deep neural network for robust speech recognition. In: Proceedings of the IEEE Spoken Language Technology Workshop (2014)

    Book  Google Scholar 

  31. Zhao, R., Li, J., Gong, Y.: Variable-component deep neural network for robust speech recognition. In: Proceedings of the Interspeech (2014)

    Google Scholar 

  32. Zhao, Y., Li, J., Xue, J., Gong, Y.: Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data. In: Proceedings of the ICASSP, pp. 4310–4314 (2015)

    Google Scholar 

  33. Zhao, Y., Li, J., Gong, Y.: Low-rank plus diagonal adaptation for deep neural networks. In: Proceedings of the ICASSP (2016)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yifan Gong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Gong, Y. et al. (2017). Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics