Abstract
Deep learning (DL) network acoustic modeling has been widely deployed in real-world speech recognition products and services that benefit millions of users. In addition to the general modeling research that academics work on, there are special constraints and challenges that the industry has to face, e.g., the run-time constraint on system deployment, robustness to variations such as the acoustic environment, accents, lack of manual transcription, etc. For large-scale automatic speech recognition applications, this chapter briefly describes selected developments and investigations at Microsoft to make deep learning networks more effective in a production environment, including reducing run-time cost with singular-value-decomposition-based training, improving the accuracy of small-size deep neural networks (DNNs) with teacher–student training, the use of a small amount of parameters for speaker adaptation of acoustic models, improving the robustness to the acoustic environment with variable-component DNN modeling, improving the robustness to accent/dialect with model adaptation and accent-dependent modeling, introducing time and frequency invariance with time–frequency long short-term memory recurrent neural networks, exploring the generalization capability to unseen data with maximum margin sequence training, the use of unsupervised data to improve speech recognition accuracy, and increasing language capability by reusing speech-training material across languages. The outcome has enabled the deployment of DL acoustic models across Microsoft server and client product lines including Windows 10 desktop/laptop/phones, XBOX, and skype speech-to-speech translation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, T., Huang, C., Chang, E., Wang, J.: Automatic accent identification using Gaussian mixture models. In: Proceedings of the Workshop on Automatic Speech Recognition and Understanding (2001)
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, Y., Gong, Y.: Regularized sequence-level deep neural network model adaptation. In: Proceedings of the Interspeech (2015)
Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7304–7308 (2013)
Huang, Y., Yu, D., Gong, Y., Liu, C.: Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration. In: Proceedings of the Interspeech (2013)
Huang, Y., Yu, D., Liu, C., Gong, Y.: Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In: Proceedings of the Interspeech (2014)
Huang, Y., Wang, Y., Gong, Y.: Semi-supervised training in deep learning acoustic models. In: Proceedings of the Interspeech (2016)
Kumar, K., Liu, C., Yao, K., Gong, Y.: Intermediate-layer DNN adaptation for offline and session-based iterative speaker adaptation. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Kumar, K., Liu, C., Gong, Y.: Non-negative intermediate-layer DNN adaptation for a 10-kb speaker adaptation profile. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016)
Li, J., Yu, D., Huang, J.T., Gong, Y.: Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In: Proceedings of the IEEE Spoken Language Technology Workshop, pp. 131–136 (2012)
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)
Li, J., Huang, J.T., Gong, Y.: Factorized adaptation for deep neural network. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2014)
Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of the Interspeech (2014)
Li, J., Deng, L., Haeb-Umbach, R., Gong, Y.: Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic, London (2015)
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: Proceedings of the Workshop on Automatic Speech Recognition and Understanding (2015)
Li, J., Mohamed, A., Zweig, G., Gong, Y.: Exploring multidimensional LSTMs for large vocabulary ASR. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2016)
Miao, Y., Li, J., Wang, Y., Zhang, S., Gong, Y.: Simplifying long short-term memory acoustic models for fast training and decoding. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2016)
Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4273–4276 (2012)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH, pp. 338–342 (2014)
Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2013)
Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of the SLT, pp. 171–176 (2014)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Xue, J., Li, J., Gong, Y.: Restructuring of deep neural network acoustic models with singular value decomposition. In: Proceedings of the Interspeech, pp. 2365–2369 (2013)
Xue, J., Li, J., Yu, D., Seltzer, M., Gong, Y.: Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6359–6363 (2014)
Ye, G., Liu, C., Gong, Y.: Geo-location dependent deep neural network acoustic model for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5870–5874 (2016)
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2013)
Zhang, S.X., Liu, C., Yao, K., Gong, Y.: Deep neural support vector machines for speech recognition. In: ICASSP, pp. 4275–4279. IEEE, New York (2015)
Zhang, S.X., Zhao, R., Liu, C., Li, J., Gong, Y.: Recurrent support vector machines for speech recognition. In: ICASSP. IEEE, New York (2016)
Zhao, R., Li, J., Gong, Y.: Variable-activation and variable-input deep neural network for robust speech recognition. In: Proceedings of the IEEE Spoken Language Technology Workshop (2014)
Zhao, R., Li, J., Gong, Y.: Variable-component deep neural network for robust speech recognition. In: Proceedings of the Interspeech (2014)
Zhao, Y., Li, J., Xue, J., Gong, Y.: Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data. In: Proceedings of the ICASSP, pp. 4310–4314 (2015)
Zhao, Y., Li, J., Gong, Y.: Low-rank plus diagonal adaptation for deep neural networks. In: Proceedings of the ICASSP (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Gong, Y. et al. (2017). Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)