A Conditional Generative Model for Speech Enhancement

Article
  • 71 Downloads

Abstract

Deep learning-based speech enhancement approaches like deep neural networks (DNN) and Long Short-Term Memory (LSTM) have already demonstrated superior results to classical methods. However, these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled. This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra. We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing.

Keywords

Deep learning Speech enhancement Generative model Adversarial training 

References

  1. 1.
    M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
  2. 2.
    S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 27(2), 113–120 (1979)Google Scholar
  3. 3.
    J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. in Proceedings of 17th Annual Conference on International Speech Communication Association (2016), pp. 3314–3318Google Scholar
  4. 4.
    T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
  5. 5.
    Z. Chen, S. Watanabe, H. Erdogan, J. Hershey, Integration of speech enhancement and recognition using long-short term memory recurrent neural network. in INTERSPEECH (2015)Google Scholar
  6. 6.
    I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal process. 81(11), 2403–2418 (2001)CrossRefMATHGoogle Scholar
  7. 7.
    J. Du, Y. Tu, L.R. Dai, C.H. Lee, A regression approach to single-channel speech separation via high-resolution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(8), 1424–1437 (2016)CrossRefGoogle Scholar
  8. 8.
    Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 33(2), 443–445 (1985)Google Scholar
  9. 9.
    H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712Google Scholar
  10. 10.
    T. Gao, J. Du, L.R. Dai, C.H. Lee, SNR-based progressive learning of deep neural network for speech enhancement. in INTERSPEECH (2016), pp. 3713–3717Google Scholar
  11. 11.
    J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993)Google Scholar
  12. 12.
    I. Goodfellow, NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
  13. 13.
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets. in Advances in Neural Information Processing Systems (2014), pp. 2672–2680Google Scholar
  14. 14.
    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778Google Scholar
  15. 15.
    S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  16. 16.
    S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  17. 17.
    ITU: Test signals for use in telephonometry. ITU-T Recommendation P.501 (Aug. 1996)Google Scholar
  18. 18.
    D. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  19. 19.
    A. Kumar, D. Florencio, speech enhancement in multiple-noise conditions using deep neural networks. in INTERSPEECH (2016), pp. 3738–3742Google Scholar
  20. 20.
    V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines. in 2010 International Conference on Machine Learning (2010), pp. 807–814Google Scholar
  21. 21.
    S. Pascual, A. Bonafonte, J. Serrà, SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017)
  22. 22.
    A. Radford, L. Metz, S. Chintala, unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  23. 23.
    C. Recommendation, Pulse code modulation (PCM) of voice frequencies. ITU (1988)Google Scholar
  24. 24.
    A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T Recommendation 862 (2001)Google Scholar
  25. 25.
    L. Sun, J. Du, L.R. Dai, C.H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 136–140Google Scholar
  26. 26.
    C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 19(7), 2125–2136 (2011)CrossRefGoogle Scholar
  27. 27.
    T. Tieleman, G. Hinton, Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. Tech. rep., Technical report 2012 31 (2012)Google Scholar
  28. 28.
    T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 15(8), 2222–2235 (2007)CrossRefGoogle Scholar
  29. 29.
    A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)Google Scholar
  30. 30.
    A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al, Conditional image generation with PixelCNN decoders. in Advances in Neural Information Processing Systems (2016), pp. 4790–4798Google Scholar
  31. 31.
    A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)CrossRefGoogle Scholar
  32. 32.
    Q. Wang, J. Du, L.R. Dai, C.H. Lee, Joint noise and mask aware training for DNN-based speech enhancement with sub-band features. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 101–105Google Scholar
  33. 33.
    Y. Wang, J. Chen, D. Wang, Deep neural network based supervised speech segregation generalizes to novel noises through large-scale training. Dept. of Comput. Sci. and Eng., The Ohio State Univ., Columbus, OH, USA, Tech. Rep. OSU-CISRC-3/15-TR02 (2015)Google Scholar
  34. 34.
    Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 22(12), 1849–1858 (2014)CrossRefGoogle Scholar
  35. 35.
    F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. in International Conference on Latent Variable Analysis and Signal Separation (Springer, 2015), pp. 91–99Google Scholar
  36. 36.
    F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation. in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (IEEE, 2014), pp. 577–581Google Scholar
  37. 37.
    D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for joint enhancement of magnitude and phase. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5220–5224Google Scholar
  38. 38.
    Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)CrossRefGoogle Scholar
  39. 39.
    Y. Xu, J. Du, L.R. Dai, C.H. Lee, Global variance equalization for improving deep neural network based speech enhancement. in 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP) (IEEE, 2014), pp. 71–75Google Scholar
  40. 40.
    Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 7–19 (2015).  https://doi.org/10.1109/TASLP.2014.2364452 CrossRefGoogle Scholar
  41. 41.
    Y. Xu, J. Du, Z. Huang, L.R. Dai, C.H. Lee, Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. arXiv preprint arXiv:1703.07172 (2017)
  42. 42.
    X.L. Zhang, D. Wang, Multi-resolution stacking for speech separation based on boosted DNN. in INTERSPEECH (2015), pp. 1745–1749Google Scholar
  43. 43.
    X.L. Zhang, D. Wang, A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(5), 967–977 (2016)CrossRefGoogle Scholar
  44. 44.
    Y. Zhao, D. Wang, I. Merks, T. Zhang, DNN-based enhancement of noisy and reverberant speech. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 6525–6529Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.National Engineering Laboratory for Speech and Language Information ProcessingUniversity of Science and Technology of ChinaHefeiChina
  2. 2.School of ComputingUniversity of KentMedwayUK

Personalised recommendations