Abstract
Deep Neural Network (DNN) model has been achieved a significant result over the Mongolian speech recognition task, however, compared to Chinese, English or the others, there are still opportunities for further enhancements. This paper presents the first application of Feed-forward Sequential Memory Network (FSMN) for Mongolian speech recognition tasks to model long-term dependency in time series without using recurrent feedback. Furthermore, by modeling the speaker in the feature space, we extract the i-vector features and combine them with the Fbank features as the input to validate their effectiveness in Mongolian ASR tasks. Finally, discriminative training was firstly conducted over the FSMN by using maximum mutual information (MMI) and state-level minimum Bayes risk (sMBR), respectively. The experimental results show that: FSMN possesses better performance than DNN in the Mongolian ASR, and by using i-vector features combined with Fbank features as FSMN input and discriminative training, the word error rate (WER) is relatively reduced by 17.9% compared with the DNN baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hinton, G., Deng, L., Dong, Y., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 38th ICASSP, pp. 6645–6649. IEEE Press, Vancouver (2013)
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: 15th INTERSPEECH, Singapore, pp. 338–342 (2014)
Zhang, S.L., Jiang, H., Wei, S., et al.: Feedforward sequential memory neural networks without recurrent feedback. Comput. Sci. arXiv:1510.02693 (2015)
Zhang, S., Liu, C., Jiang, H., et al.: Feedforward sequential memory networks: a new structure to learn long-term dependency. Comput. Sci. arXiv:1512.08301 (2015)
Gao, G., Biligetu, Nabuqing, Zhang, S.: A Mongolian speech recognition system based on HMM. In: Huang, D.S., Li, K., Irwin, G.W. (eds.) ICIC 2006. LNCS, vol. 4114, pp. 667–676. Springer, Heidelberg (2015). https://doi.org/10.1007/11816171_84
Qilao, H., Gao, G.L.: Researching of speech recognition oriented mongolian acoustic model. In: Chinese Conference on 2nd Pattern Recognition, CCPR 2008, pp. 1–6. IEEE Press, Beijing (2008)
Bao, F., Gao, G.: Improving of acoustic model for the mongolian speech recognition system. In: Chinese Conference on 2nd Pattern Recognition, CCPR 2009, pp. 1–5. IEEE Press, Nanjing (2009)
Bao, F., Gao, G., Yan, X., Wang, W.: Segmentation-based Mongolian LVCSR approach. In: 38th ICASSP 2013, pp. 1–5. IEEE Press, Vancouver (2013)
Zhang, H., Bao, F., Gao, G.: Mongolian speech recognition based on deep neural networks. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds.) CCL 2015. LNCS (LNAI), vol. 9427, pp. 180–188. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25816-4_15
Alam, M.J., Gupta, V., Kenny, P., Dumouchel, P.: Use of multiple front-ends and I-vector-based speaker adaptation for robust speech recognition. In: REVERB Workshop. (2014)
Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)
Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with I-vector inputs. In: 39th ICASSP, pp. 225–229. IEEE Press, Florence (2014)
Peddinti, V., Chen, G., Povey, D., Khudanpur, S.: Reverberation robust acoustic modeling using i-vectors with time delay neural networks. In: 16th INTERSPEECH, Dresden, pp. 2440–2444 (2015)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Silovsky, J.: The Kaldi speech recognition toolkit. In: Workshop on Automatic Speech Recognition and Understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society (2011)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: 30th ICML Workshop on Deep Learning for Audio, Speech and Language Processing (2013)
Acknowledgements
This research was supports in part by the China national natural science foundation (No. 61563040, No. 61773224) and Inner Mongolian nature science foundation (No. 2016ZD06).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Wang, Y., Bao, F., Zhang, H., Gao, G. (2018). Research on Mongolian Speech Recognition Based on FSMN. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-73618-1_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73617-4
Online ISBN: 978-3-319-73618-1
eBook Packages: Computer ScienceComputer Science (R0)