Research on speech separation technology based on deep learning
- 2 Downloads
In order to solve the problem of instability of the traditional speech separation algorithm, a kind of reverberation speech separation model based on deep learning is proposed. The problem of speech separation in reverberation environment has been studied. The auditory scene analysis is used to simulate the human auditory perception ability. According to the ideal two value mode principle, the target speech signal can be extracted. Moreover, the deep neural network (DNN) shows great learning ability in speech recognition and artificial intelligence. In this paper, a DNN model is proposed to learn the inverse reverberation and denoising by learning the spectrum mapping between “contaminated” speech and pure speech. By extracting a series of spectrum features, the time dynamic information of adjacent frames is fused. The DNN is used to transform the coded spectrum, and restore the pure voice frequency spectrum. Finally, the time domain signal is reconstructed. In addition, the feature classification ability of DNN is also proposed to complete the separation of double sound reverberation speech. The binaural features ITD and ILD and the mono features GFCC are fused to form a long eigenvector. The DNN is pre-trained by RBM to complete the classification task. The results show that the proposed model improves the quality and intelligibility of the speech separation, and enhances the stability of the system significantly.
KeywordsAuditory scene analysis Speech separation Spectrum feature Deep learning
The authors acknowledge the National Natural Science Foundation of China (Grant: 61372146, 61373098), the Youth Natural Science Foundation of Jiangsu Province of China (Grant: BK20160361), the Qinglan Project Young and Middle-aged Academic Leader Foundation of Jiangsu Province, the Professional Leader Advanced Research Project Foundation of Higher Vocational College of Jiangsu Province (Grant: 2017GRFX046).
- 2.Asaei, A., Taghizadeh, M. J., Cevher, V.: Computational methods for underdetermined convolutive speech localization and separation via model-based sparse component analysis. Speech Commun. 76(C), 201–217 (2016)Google Scholar
- 6.Vander, G.M., Bourguignon, M., de Beeck, M., Wens, V., Marty, B., Hassid, S., et al.: Left superior temporal gyrus is coupled to attended speech in a cocktail-party auditory scene. J. Neurosci. 36(5), 1596–1606 (2016)Google Scholar
- 9.Moossavi, A., Mehrkian, S., Lotfi, Y., Faghih Zadeh, S., Adjedi, H.: The effect of working memory training on auditory stream segregation in auditory processing disorders children. Optics Commun 281(9), 2491–2497 (2015)Google Scholar
- 13.Léger, A.C., Reed, C.M., Desloge, J.G., Swaminathan, J., Braida, L.D.: Consonant identification in noise using hilbert-transform temporal fine-structure speech and recovered-envelope speech for listeners with normal and impaired hearing. J. Acoust. Soc. Am. 138(1), 389–403 (2015)CrossRefGoogle Scholar