Research on speech separation technology based on deep learning



In order to solve the problem of instability of the traditional speech separation algorithm, a kind of reverberation speech separation model based on deep learning is proposed. The problem of speech separation in reverberation environment has been studied. The auditory scene analysis is used to simulate the human auditory perception ability. According to the ideal two value mode principle, the target speech signal can be extracted. Moreover, the deep neural network (DNN) shows great learning ability in speech recognition and artificial intelligence. In this paper, a DNN model is proposed to learn the inverse reverberation and denoising by learning the spectrum mapping between “contaminated” speech and pure speech. By extracting a series of spectrum features, the time dynamic information of adjacent frames is fused. The DNN is used to transform the coded spectrum, and restore the pure voice frequency spectrum. Finally, the time domain signal is reconstructed. In addition, the feature classification ability of DNN is also proposed to complete the separation of double sound reverberation speech. The binaural features ITD and ILD and the mono features GFCC are fused to form a long eigenvector. The DNN is pre-trained by RBM to complete the classification task. The results show that the proposed model improves the quality and intelligibility of the speech separation, and enhances the stability of the system significantly.


Auditory scene analysis Speech separation Spectrum feature Deep learning 



The authors acknowledge the National Natural Science Foundation of China (Grant: 61372146, 61373098), the Youth Natural Science Foundation of Jiangsu Province of China (Grant: BK20160361), the Qinglan Project Young and Middle-aged Academic Leader Foundation of Jiangsu Province, the Professional Leader Advanced Research Project Foundation of Higher Vocational College of Jiangsu Province (Grant: 2017GRFX046).


  1. 1.
    Barker, J.P.: Evaluation of scene analysis using real and simulated acoustic mixtures: lessons learnt from the chime speech recognition challenges. J. Acoust. Soc. Am. 141(5), 3693–3693 (2017)CrossRefGoogle Scholar
  2. 2.
    Asaei, A., Taghizadeh, M. J., Cevher, V.: Computational methods for underdetermined convolutive speech localization and separation via model-based sparse component analysis. Speech Commun. 76(C), 201–217 (2016)Google Scholar
  3. 3.
    Josupeit, A., Kopčo, N., Hohmann, V.: Modeling of speech localization in a multi-talker mixture using periodicity and energy-based auditory features. J. Acoust. Soc. Am. 139(5), 2911 (2016)CrossRefGoogle Scholar
  4. 4.
    Scholes, C., Palmer, A.R., Sumner, C.J.: Stream segregation in the anesthetized auditory cortex. Hear. Res. 328(2), 48–58 (2015)CrossRefGoogle Scholar
  5. 5.
    Denham, S., Coath, M.: The role of form in modeling auditory scene analysis. J. Acoust. Soc. Am. 137(4), 2249–2249 (2015)CrossRefGoogle Scholar
  6. 6.
    Vander, G.M., Bourguignon, M., de Beeck, M., Wens, V., Marty, B., Hassid, S., et al.: Left superior temporal gyrus is coupled to attended speech in a cocktail-party auditory scene. J. Neurosci. 36(5), 1596–1606 (2016)Google Scholar
  7. 7.
    Rogalsky, C., Poppa, T., Chen, K.H., Anderson, S.W., Damasio, H., Love, T., et al.: Speech repetition as a window on the neurobiology of auditory-motor integration for speech: a voxel-based lesion symptom mapping study. Neuropsychologia 71(01), 18 (2015)CrossRefGoogle Scholar
  8. 8.
    White-Schwoch, T., Davies, E.C., Thompson, E.C., Carr, K.W., Nicol, T., Bradlow, A.R., et al.: Auditory-neurophysiological responses to speech during early childhood: effects of background noise. Hear. Res. 328, 34–47 (2015)CrossRefGoogle Scholar
  9. 9.
    Moossavi, A., Mehrkian, S., Lotfi, Y., Faghih Zadeh, S., Adjedi, H.: The effect of working memory training on auditory stream segregation in auditory processing disorders children. Optics Commun 281(9), 2491–2497 (2015)Google Scholar
  10. 10.
    Kenway, B., Tam, Y.C., Vanat, Z., Harris, F., Gray, R., Birchall, J., et al.: Pitch discrimination: an independent factor in cochlear implant performance outcomes. Otol. Neurotol. 36(9), 1472–1479 (2015)CrossRefGoogle Scholar
  11. 11.
    Mathon, B., Ulvin, L.B., Adam, C., Baulac, M., Dupont, S., Navarro, V., et al.: Surgical treatment for mesial temporal lobe epilepsy associated with hippocampal sclerosis. Revue Neurol. 171(3), 315–325 (2015)CrossRefGoogle Scholar
  12. 12.
    Leclère, T., Lavandier, M., Culling, J.F.: Speech intelligibility prediction in reverberation: towards an integrated model of speech transmission, spatial unmasking, and binaural de-reverberation. J. Acoust. Soc. Am. 137(6), 3335–3345 (2015)CrossRefGoogle Scholar
  13. 13.
    Léger, A.C., Reed, C.M., Desloge, J.G., Swaminathan, J., Braida, L.D.: Consonant identification in noise using hilbert-transform temporal fine-structure speech and recovered-envelope speech for listeners with normal and impaired hearing. J. Acoust. Soc. Am. 138(1), 389–403 (2015)CrossRefGoogle Scholar
  14. 14.
    Koralus, P.: Can visual cognitive neuroscience learn anything from the philosophy of language? ambiguity and the topology of neural network models of multistable perception. Synthese 193(5), 1409–1432 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Electronic and Information EngineeringSuzhou Vocational UniversitySuzhouChina
  2. 2.School of Electronic and Information EngineeringSoochow UniversitySuzhouChina
  3. 3.College of Electronics and Information EngineeringSuzhou Science and Technology UniversitySuzhouChina

Personalised recommendations