Abstract
Mobile devices have limited computing power and limited memory. Thus, large deep neural network (DNN) based acoustic models are not well suited for application on mobile devices. In order to alleviate this problem, this paper proposes to compress acoustic models by using knowledge transfer. This approach forces a large teacher model to transfer generalized knowledge to a small student model. The student model is trained with a linear interpolation of hard probabilities and soft probabilities to learn generalized knowledge from the teacher model. The hard probabilities are generated from a Gaussian mixture model hidden Markov model (GMM-HMM) system. The soft probabilities are computed from a teacher model (DNN or RNN). Experiments on AMI corpus show that a small student model obtains 2.4% relative WER improvement over a large teacher model with almost 7.6 times compression ratio.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dahl, G.E., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., et al.: Recent advances in deep learning for speech research at microsoft. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp. 8604–8608. IEEE (2013)
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp. 6645–6649. IEEE (2013)
Graves, A., Jaitly, N., Mohamed, A.R.: Hybrid speech recognition with deep bidirectional LSTM. In: Automatic Speech Recognition and Understanding, Olomouc, pp. 273–278. IEEE (2013)
Weng, C., Yu, D., Watanabe, S., Juang, B.H.F.: Recurrent deep neural networks for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, pp. 5532–5536. IEEE (2014)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Comput. Sci. 20(1), 338–342 (2014)
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp. 6655–6659. IEEE (2013)
Lu, Z., Sindhwani, V., Sainath, T.N.: Learning compact recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, pp. 5960–5964. IEEE (2016)
Xue, J., Li, J., Gong, Y.: Restructuring of deep neural network acoustic models with singular value decomposition. In: 14th Annual Conference of the International Speech Communication Association, Lyon, pp. 662–665. ISCA (2013)
Prabhavalkar, R., Alsharif, O., Bruguier, A., Mcgraw, I.: On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, pp. 5970–5974. IEEE (2016)
Vanhoucke, V., Devin, M., Heigold, G.: Multiframe deep neural networks for acoustic modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp. 6645–6649. IEEE (2013)
Lei, X., Senior, A., Gruenstein, A., Sorensen, J.: Accurate and compact large vocabulary speech recognition on mobile devices. In: 14th Annual Conference of the International Speech Communication Association, Lyon, pp. 2365–2369. ISCA (2013)
Wang, Y., Li, J., Gong, Y.: Small-footprint high-performance deep neural network-based speech recognition using split-VQ. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, pp. 4984–4988. IEEE (2015)
Mcgraw, I., Prabhavalkar, R., Alvarez, R., Arenas, M.G., Rao, K., Rybach, D., et al.: Personalized speech recognition on mobile devices. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, pp. 5955–5959. IEEE (2016)
Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size DNN with output-distribution-based criteria. In: 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 1910–1914. ISCA (2014)
Chan, W., Ke, N.R., Lane, I.: Transferring knowledge from a RNN to a DNN. In: 15th Annual Conference of the International Speech Communication Association, Dresden, pp. 3264– 3268. ISCA (2015)
Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, pp. 535–541. ACM (2006)
Ba, L.J., Caruana, R.: Do deep nets really need to be deep? Adv. Neural. Inf. Process. Syst. 12(1), 2654–2662 (2014)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. Comput. Sci. 14(7), 35–39 (2015)
Chen, W., Wilson, J.T., Tyree, S., Weinberger, K.Q., Chen, Y.: Compressing neural networks with the hashing trick. Comput. Sci. 20(2), 2285–2294 (2015)
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, pp. 7893–7897. IEEE (2013)
Huang, Y., Yu, D., Liu, C., Gong, A.Y.: Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In: 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 2977–2981. ISCA (2014)
Liu, C., Wang, Y., Kumar, K., Gong, Y.: Investigations on speaker adaptation of LSTM RNN models for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, pp. 5020–5024. IEEE (2016)
Chebotar, Y., Waters, A.: Distilling knowledge from ensembles of neural networks for speech recognition. In: 17th Annual Conference of the International Speech Communication Association, San Francisco, pp. 3439–3443. ISCA (2016)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. Comput. Sci. 10(2), 138–143 (2014)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al.: The Kaldi speech recognition toolkit. In: Automatic Speech Recognition and Understanding, Hawaï, pp. 4–9. IEEE (2011)
RASC863: 863 annotated 4 regional accent speech corpus. http://www.chineseldc.org/doc/CLDC-SPC-2004-003/intro.htm. Accessed 7 Nov 2017
Carletta, J.: Announcing the AMI meeting corpus. The ELRA Newsl. 11(1), 3–5 (2012)
Acknowledgements
This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No. 2015AA016305), the National Natural Science Foundation of China (NSFC) (No. 61425017, No. 61403386).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yi, J., Tao, J., Wen, Z., Li, Y., Ni, H. (2018). Acoustic Model Compression with Knowledge Transfer. In: Tao, J., Zheng, T., Bao, C., Wang, D., Li, Y. (eds) Man-Machine Speech Communication. NCMMSC 2017. Communications in Computer and Information Science, vol 807. Springer, Singapore. https://doi.org/10.1007/978-981-10-8111-8_9
Download citation
DOI: https://doi.org/10.1007/978-981-10-8111-8_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8110-1
Online ISBN: 978-981-10-8111-8
eBook Packages: Computer ScienceComputer Science (R0)