A Conditional Generative Model for Speech Enhancement

Li, Zeng-Xi; Dai, Li-Rong; Song, Yan; McLoughlin, Ian

doi:10.1007/s00034-018-0798-4

A Conditional Generative Model for Speech Enhancement

Published: 13 March 2018

Volume 37, pages 5005–5022, (2018)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Zeng-Xi Li ORCID: orcid.org/0000-0002-1507-0971¹,
Li-Rong Dai¹,
Yan Song¹ &
…
Ian McLoughlin²

630 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

Deep learning-based speech enhancement approaches like deep neural networks (DNN) and Long Short-Term Memory (LSTM) have already demonstrated superior results to classical methods. However, these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled. This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra. We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ATT:Adversarial Trained Transformer for Speech Enhancement

Learning an Adversarial Network for Speech Enhancement Under Extremely Low Signal-to-Noise Ratio Condition

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

Notes

Noises were con_bin, met_mono, off_mono, car_mono, rai_mono, res_mono, train, traffic from the ITU-T recommendation P.501 database [17] and white, factory1, factory2, babble, machinegun from NOISEX-92 [31], each at levels of \(\{-5,0,5,10,15,20\}\)dB SNR.
Matched noises: white from NOISEX-92 database, res_mono and con_bin from ITU-T recommendation P.501 database; Mismatched noises: destroyerops, f16 and m109 from NOISEX-92 database.

References

M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 27(2), 113–120 (1979)
Google Scholar
J. Chen, D. Wang, Long short-term memory for speaker generalization in supervised speech separation. in Proceedings of 17th Annual Conference on International Speech Communication Association (2016), pp. 3314–3318
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Z. Chen, S. Watanabe, H. Erdogan, J. Hershey, Integration of speech enhancement and recognition using long-short term memory recurrent neural network. in INTERSPEECH (2015)
I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal process. 81(11), 2403–2418 (2001)
Article Google Scholar
J. Du, Y. Tu, L.R. Dai, C.H. Lee, A regression approach to single-channel speech separation via high-resolution deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(8), 1424–1437 (2016)
Article Google Scholar
Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 33(2), 443–445 (1985)
Google Scholar
H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 708–712
T. Gao, J. Du, L.R. Dai, C.H. Lee, SNR-based progressive learning of deep neural network for speech enhancement. in INTERSPEECH (2016), pp. 3713–3717
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993)
I. Goodfellow, NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets. in Advances in Neural Information Processing Systems (2014), pp. 2672–2680
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. in IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
ITU: Test signals for use in telephonometry. ITU-T Recommendation P.501 (Aug. 1996)
D. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
A. Kumar, D. Florencio, speech enhancement in multiple-noise conditions using deep neural networks. in INTERSPEECH (2016), pp. 3738–3742
V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines. in 2010 International Conference on Machine Learning (2010), pp. 807–814
S. Pascual, A. Bonafonte, J. Serrà, SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017)
A. Radford, L. Metz, S. Chintala, unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
C. Recommendation, Pulse code modulation (PCM) of voice frequencies. ITU (1988)
A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T Recommendation 862 (2001)
L. Sun, J. Du, L.R. Dai, C.H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 136–140
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 19(7), 2125–2136 (2011)
Article Google Scholar
T. Tieleman, G. Hinton, Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. Tech. rep., Technical report 2012 31 (2012)
T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 15(8), 2222–2235 (2007)
Article Google Scholar
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)
A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al, Conditional image generation with PixelCNN decoders. in Advances in Neural Information Processing Systems (2016), pp. 4790–4798
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
Q. Wang, J. Du, L.R. Dai, C.H. Lee, Joint noise and mask aware training for DNN-based speech enhancement with sub-band features. in Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017 (IEEE, 2017), pp. 101–105
Y. Wang, J. Chen, D. Wang, Deep neural network based supervised speech segregation generalizes to novel noises through large-scale training. Dept. of Comput. Sci. and Eng., The Ohio State Univ., Columbus, OH, USA, Tech. Rep. OSU-CISRC-3/15-TR02 (2015)
Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 22(12), 1849–1858 (2014)
Article Google Scholar
F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. in International Conference on Latent Variable Analysis and Signal Separation (Springer, 2015), pp. 91–99
F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation. in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (IEEE, 2014), pp. 577–581
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for joint enhancement of magnitude and phase. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 5220–5224
Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Article Google Scholar
Y. Xu, J. Du, L.R. Dai, C.H. Lee, Global variance equalization for improving deep neural network based speech enhancement. in 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP) (IEEE, 2014), pp. 71–75
Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 7–19 (2015). https://doi.org/10.1109/TASLP.2014.2364452
Article Google Scholar
Y. Xu, J. Du, Z. Huang, L.R. Dai, C.H. Lee, Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. arXiv preprint arXiv:1703.07172 (2017)
X.L. Zhang, D. Wang, Multi-resolution stacking for speech separation based on boosted DNN. in INTERSPEECH (2015), pp. 1745–1749
X.L. Zhang, D. Wang, A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(5), 967–977 (2016)
Article MathSciNet Google Scholar
Y. Zhao, D. Wang, I. Merks, T. Zhang, DNN-based enhancement of noisy and reverberant speech. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2016), pp. 6525–6529

Download references

Author information

Authors and Affiliations

National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China
Zeng-Xi Li, Li-Rong Dai & Yan Song
School of Computing, University of Kent, Medway, UK
Ian McLoughlin

Authors

Zeng-Xi Li
View author publications
You can also search for this author in PubMed Google Scholar
Li-Rong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yan Song
View author publications
You can also search for this author in PubMed Google Scholar
Ian McLoughlin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeng-Xi Li.

Additional information

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61273264 and U1613211), and by Science and Technology Department of Anhui Province (No. 15CZZ02007).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, ZX., Dai, LR., Song, Y. et al. A Conditional Generative Model for Speech Enhancement. Circuits Syst Signal Process 37, 5005–5022 (2018). https://doi.org/10.1007/s00034-018-0798-4

Download citation

Received: 04 September 2017
Revised: 01 March 2018
Accepted: 05 March 2018
Published: 13 March 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s00034-018-0798-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Conditional Generative Model for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

ATT:Adversarial Trained Transformer for Speech Enhancement

Learning an Adversarial Network for Speech Enhancement Under Extremely Low Signal-to-Noise Ratio Condition

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Conditional Generative Model for Speech Enhancement

Abstract

Access this article

Similar content being viewed by others

ATT:Adversarial Trained Transformer for Speech Enhancement

Learning an Adversarial Network for Speech Enhancement Under Extremely Low Signal-to-Noise Ratio Condition

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation