Skip to main content
Log in

A Conditional Generative Model for Speech Enhancement

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Deep learning-based speech enhancement approaches like deep neural networks (DNN) and Long Short-Term Memory (LSTM) have already demonstrated superior results to classical methods. However, these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled. This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra. We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Noises were con_bin, met_mono, off_mono, car_mono, rai_mono, res_mono, train, traffic from the ITU-T recommendation P.501 database [17] and white, factory1, factory2, babble, machinegun from NOISEX-92 [31], each at levels of \(\{-5,0,5,10,15,20\}\)dB SNR.

  2. Matched noises: white from NOISEX-92 database, res_mono and con_bin from ITU-T recommendation P.501 database; Mismatched noises: destroyerops, f16 and m109 from NOISEX-92 database.

References

  1. M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)

  2. S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 27(2), 113–120 (1979)

    Google Scholar 

  3. J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. in Proceedings of 17th Annual Conference on International Speech Communication Association (2016), pp. 3314–3318

  4. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)

  5. Z. Chen, S. Watanabe, H. Erdogan, J. Hershey, Integration of speech enhancement and recognition using long-short term memory recurrent neural network. in INTERSPEECH (2015)

  6. I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal process. 81(11), 2403–2418 (2001)

    Article  Google Scholar 

  7. J. Du, Y. Tu, L.R. Dai, C.H. Lee, A regression approach to single-channel speech separation via high-resolution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(8), 1424–1437 (2016)

    Article  Google Scholar 

  8. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 33(2), 443–445 (1985)

    Google Scholar 

  9. H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712

  10. T. Gao, J. Du, L.R. Dai, C.H. Lee, SNR-based progressive learning of deep neural network for speech enhancement. in INTERSPEECH (2016), pp. 3713–3717

  11. J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993)

  12. I. Goodfellow, NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)

  13. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets. in Advances in Neural Information Processing Systems (2014), pp. 2672–2680

  14. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778

  15. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

  17. ITU: Test signals for use in telephonometry. ITU-T Recommendation P.501 (Aug. 1996)

  18. D. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  19. A. Kumar, D. Florencio, speech enhancement in multiple-noise conditions using deep neural networks. in INTERSPEECH (2016), pp. 3738–3742

  20. V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines. in 2010 International Conference on Machine Learning (2010), pp. 807–814

  21. S. Pascual, A. Bonafonte, J. Serrà, SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017)

  22. A. Radford, L. Metz, S. Chintala, unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

  23. C. Recommendation, Pulse code modulation (PCM) of voice frequencies. ITU (1988)

  24. A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T Recommendation 862 (2001)

  25. L. Sun, J. Du, L.R. Dai, C.H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 136–140

  26. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  27. T. Tieleman, G. Hinton, Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. Tech. rep., Technical report 2012 31 (2012)

  28. T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 15(8), 2222–2235 (2007)

    Article  Google Scholar 

  29. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)

  30. A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al, Conditional image generation with PixelCNN decoders. in Advances in Neural Information Processing Systems (2016), pp. 4790–4798

  31. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  32. Q. Wang, J. Du, L.R. Dai, C.H. Lee, Joint noise and mask aware training for DNN-based speech enhancement with sub-band features. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 101–105

  33. Y. Wang, J. Chen, D. Wang, Deep neural network based supervised speech segregation generalizes to novel noises through large-scale training. Dept. of Comput. Sci. and Eng., The Ohio State Univ., Columbus, OH, USA, Tech. Rep. OSU-CISRC-3/15-TR02 (2015)

  34. Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  35. F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. in International Conference on Latent Variable Analysis and Signal Separation (Springer, 2015), pp. 91–99

  36. F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation. in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (IEEE, 2014), pp. 577–581

  37. D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for joint enhancement of magnitude and phase. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5220–5224

  38. Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)

    Article  Google Scholar 

  39. Y. Xu, J. Du, L.R. Dai, C.H. Lee, Global variance equalization for improving deep neural network based speech enhancement. in 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP) (IEEE, 2014), pp. 71–75

  40. Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 7–19 (2015). https://doi.org/10.1109/TASLP.2014.2364452

    Article  Google Scholar 

  41. Y. Xu, J. Du, Z. Huang, L.R. Dai, C.H. Lee, Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. arXiv preprint arXiv:1703.07172 (2017)

  42. X.L. Zhang, D. Wang, Multi-resolution stacking for speech separation based on boosted DNN. in INTERSPEECH (2015), pp. 1745–1749

  43. X.L. Zhang, D. Wang, A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(5), 967–977 (2016)

    Article  MathSciNet  Google Scholar 

  44. Y. Zhao, D. Wang, I. Merks, T. Zhang, DNN-based enhancement of noisy and reverberant speech. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 6525–6529

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeng-Xi Li.

Additional information

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61273264 and U1613211), and by Science and Technology Department of Anhui Province (No. 15CZZ02007).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, ZX., Dai, LR., Song, Y. et al. A Conditional Generative Model for Speech Enhancement. Circuits Syst Signal Process 37, 5005–5022 (2018). https://doi.org/10.1007/s00034-018-0798-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-018-0798-4

Keywords

Navigation