Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft

Gong, Yifan; Huang, Yan; Kumar, Kshitiz; Li, Jinyu; Liu, Chaojun; Ye, Guoli; Zhang, Shixiong; Zhao, Yong; Zhao, Rui

doi:10.1007/978-3-319-64680-0_19

Yifan Gong⁵,
Yan Huang⁵,
Kshitiz Kumar⁵,
Jinyu Li⁵,
Chaojun Liu⁵,
Guoli Ye⁵,
Shixiong Zhang⁵,
Yong Zhao⁵ &
…
Rui Zhao⁵

2231 Accesses

Abstract

Deep learning (DL) network acoustic modeling has been widely deployed in real-world speech recognition products and services that benefit millions of users. In addition to the general modeling research that academics work on, there are special constraints and challenges that the industry has to face, e.g., the run-time constraint on system deployment, robustness to variations such as the acoustic environment, accents, lack of manual transcription, etc. For large-scale automatic speech recognition applications, this chapter briefly describes selected developments and investigations at Microsoft to make deep learning networks more effective in a production environment, including reducing run-time cost with singular-value-decomposition-based training, improving the accuracy of small-size deep neural networks (DNNs) with teacher–student training, the use of a small amount of parameters for speaker adaptation of acoustic models, improving the robustness to the acoustic environment with variable-component DNN modeling, improving the robustness to accent/dialect with model adaptation and accent-dependent modeling, introducing time and frequency invariance with time–frequency long short-term memory recurrent neural networks, exploring the generalization capability to unseen data with maximum margin sequence training, the use of unsupervised data to improve speech recognition accuracy, and increasing language capability by reusing speech-training material across languages. The outcome has enabled the deployment of DL acoustic models across Microsoft server and client product lines including Windows 10 desktop/laptop/phones, XBOX, and skype speech-to-speech translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, T., Huang, C., Chang, E., Wang, J.: Automatic accent identification using Gaussian mixture models. In: Proceedings of the Workshop on Automatic Speech Recognition and Understanding (2001)
Google Scholar
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, Y., Gong, Y.: Regularized sequence-level deep neural network model adaptation. In: Proceedings of the Interspeech (2015)
Google Scholar
Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7304–7308 (2013)
Google Scholar
Huang, Y., Yu, D., Gong, Y., Liu, C.: Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration. In: Proceedings of the Interspeech (2013)
Google Scholar
Huang, Y., Yu, D., Liu, C., Gong, Y.: Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In: Proceedings of the Interspeech (2014)
Google Scholar
Huang, Y., Wang, Y., Gong, Y.: Semi-supervised training in deep learning acoustic models. In: Proceedings of the Interspeech (2016)
Book Google Scholar
Kumar, K., Liu, C., Yao, K., Gong, Y.: Intermediate-layer DNN adaptation for offline and session-based iterative speaker adaptation. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Kumar, K., Liu, C., Gong, Y.: Non-negative intermediate-layer DNN adaptation for a 10-kb speaker adaptation profile. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016)
Google Scholar
Li, J., Yu, D., Huang, J.T., Gong, Y.: Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In: Proceedings of the IEEE Spoken Language Technology Workshop, pp. 131–136 (2012)
Google Scholar
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)
Article Google Scholar
Li, J., Huang, J.T., Gong, Y.: Factorized adaptation for deep neural network. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2014)
Book Google Scholar
Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of the Interspeech (2014)
Google Scholar
Li, J., Deng, L., Haeb-Umbach, R., Gong, Y.: Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic, London (2015)
Google Scholar
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: Proceedings of the Workshop on Automatic Speech Recognition and Understanding (2015)
Book Google Scholar
Li, J., Mohamed, A., Zweig, G., Gong, Y.: Exploring multidimensional LSTMs for large vocabulary ASR. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2016)
Book Google Scholar
Miao, Y., Li, J., Wang, Y., Zhang, S., Gong, Y.: Simplifying long short-term memory acoustic models for fast training and decoding. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2016)
Book Google Scholar
Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4273–4276 (2012)
Google Scholar
Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH, pp. 338–342 (2014)
Google Scholar
Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2013)
Book Google Scholar
Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of the SLT, pp. 171–176 (2014)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Xue, J., Li, J., Gong, Y.: Restructuring of deep neural network acoustic models with singular value decomposition. In: Proceedings of the Interspeech, pp. 2365–2369 (2013)
Google Scholar
Xue, J., Li, J., Yu, D., Seltzer, M., Gong, Y.: Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6359–6363 (2014)
Google Scholar
Ye, G., Liu, C., Gong, Y.: Geo-location dependent deep neural network acoustic model for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5870–5874 (2016)
Google Scholar
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2013)
Book Google Scholar
Zhang, S.X., Liu, C., Yao, K., Gong, Y.: Deep neural support vector machines for speech recognition. In: ICASSP, pp. 4275–4279. IEEE, New York (2015)
Google Scholar
Zhang, S.X., Zhao, R., Liu, C., Li, J., Gong, Y.: Recurrent support vector machines for speech recognition. In: ICASSP. IEEE, New York (2016)
Book Google Scholar
Zhao, R., Li, J., Gong, Y.: Variable-activation and variable-input deep neural network for robust speech recognition. In: Proceedings of the IEEE Spoken Language Technology Workshop (2014)
Book Google Scholar
Zhao, R., Li, J., Gong, Y.: Variable-component deep neural network for robust speech recognition. In: Proceedings of the Interspeech (2014)
Google Scholar
Zhao, Y., Li, J., Xue, J., Gong, Y.: Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data. In: Proceedings of the ICASSP, pp. 4310–4314 (2015)
Google Scholar
Zhao, Y., Li, J., Gong, Y.: Low-rank plus diagonal adaptation for deep neural networks. In: Proceedings of the ICASSP (2016)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft, One Microsoft Way, Redmond, WA, 98052, USA
Yifan Gong, Yan Huang, Kshitiz Kumar, Jinyu Li, Chaojun Liu, Guoli Ye, Shixiong Zhang, Yong Zhao & Rui Zhao

Authors

Yifan Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Kshitiz Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Jinyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Chaojun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guoli Ye
View author publications
You can also search for this author in PubMed Google Scholar
Shixiong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yifan Gong .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gong, Y. et al. (2017). Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_19
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics